diff --git a/Optimus/.gitignore b/Optimus/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..02d5161d63cf4f90e5e1b1d46f7b6990a58fed5d
--- /dev/null
+++ b/Optimus/.gitignore
@@ -0,0 +1,8 @@
+data/datasets/glue_data/glue_data
+data/datasets/glue_data/train.tx
+data/datasets/glue_data/cached_lm_gpt_bert_256_train.jsont
+code/runs
+output/*
+code/pytorch_transformers/__pycache__/*
+code/examples/big_ae/modules/encoders/__pycache__/*
+
diff --git a/Optimus/README.md b/Optimus/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3eba2fd7beb484ea779b366d2828b325b92d7c82
--- /dev/null
+++ b/Optimus/README.md
@@ -0,0 +1,121 @@
+# Optimus: the first pre-trained Big VAE language model <img src="doc/figs/logo_optimus.png" width="100" align="right">  
+ 
+This repository contains source code necessary to reproduce the results presented in the EMNLP 2020 paper [Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space](https://arxiv.org/abs/2004.04092).
+
+
+|<img src="doc/figs/optimus_scheme.png" width="350"> | <img src="doc/figs/headfig_optimus.png" width="800"> 
+|-------------------------|:-------------------------:|
+| The network architecture of Optimus: encoder for representation learning and decoder for generation | Sentences are organized and manipulated in a pre-trained compact and smooth latent space 
+
+
+For more on this project, see the [Microsoft Research Blog post](https://www.microsoft.com/en-us/research/blog/a-deep-generative-model-trifecta-three-advances-that-work-towards-harnessing-large-scale-power/).
+
+
+## News
+
+May 21, 2020: Releasing a [`demo`](http://40.71.23.172:8899/) for latent space manipulation, including sentence interpolation and analogy. Check out the [`website`](http://40.71.23.172:8899/).
+
+May 20, 2020: The latent space manipulation code is cleaned and released. See instructions at [`optimius_for_snli.md`](doc/optimius_for_snli.md).
+
+May 13, 2020: The fine-tuning code for langauge modeling is released. See instructions  at [`optimus_finetune_language_models.md`](doc/optimus_finetune_language_models.md)
+
+## Contents
+There are four steps to use this codebase to reproduce the results in the paper.
+
+1. [Dependencies](#dependencies)
+2. [Prepare datasets](#prepare-datasets)
+3. [Model training](#Model-training)
+    1. Pre-training on setences in Wikipedia
+    2. Languange Modeling
+    3. Guided Language Generation
+    4. Low-resource Language Understanding
+4. [Collect and plot results](#collect-and-plot-results)
+
+
+## Dependencies
+
+Pull docker from Docker Hub at: `chunyl/pytorch-transformers:v2`. Please see the instruction at [`doc/env.md`](doc/env.md)
+
+The project is organized into the following structures, with ensential files & folders visualized.  `output` saves the models checkpoints.
+```
+├── Optimus
+   └── code
+       ├── examples
+           ├── big_ae
+               ├── modules
+                   ├── vae.py
+                   └── ...
+               ├── run_lm_vae_pretraining_phdist_beta.py
+               ├── run_lm_vae_training.py
+               └── ...
+	   ├── pytorch_transformers
+               ├── modeling_bert.py
+               ├── modeling_gpt2.py
+               └── ...
+       ├── scripts
+           ├── scripts_docker
+	   ├── scripts_local
+	   ├── scripts_philly
+   └── data
+       └── datasets
+           ├── wikipedia_json_64_filtered
+               └── ...
+	   ├── snli_data
+           └── ...
+   └── output
+       ├── pretrain
+       ├── LM
+       └── ...       
+```
+
+## Prepare Datasets
+
+Please download or preparation the data via following the instructions at [`data/download_datasets.md`](data/download_datasets.md). 
+
+## Model Training
+
+**1. Pre-training on setences in Wikipedia**
+
+We pre-trained our models on Philly (a Microsoft internal compute cluster), the code is specialized for multi-node multi-GPU compute on this platform. The pre-training main python is [`run_lm_vae_pretraining_phdist_beta.py`](code/examples/big_ae/run_lm_vae_pretraining_phdist_beta.py). You may need to adjust the distributed training scripts. 
+
+**2. Languange Modeling**
+
+To have a fair comparison with existing VAE languange models, we consider a model with latent dimension 32. The pre-trained model is fine-tuned on four commonly datasets for one epoch. Please see the details at [`doc/optimus_finetune_language_models.md`](doc/optimus_finetune_language_models.md)
+
+**3. Guided Language Generation**
+
+
+**Latent Space Manipulation** To ensure good performance, we consider a model with latent dimension 768. The pre-trained model is fine-tuned on SNLI dataset, where sentences show related patterns. Please see the details at 
+Please see the details at [`doc/optimius_for_snli.md`](doc/optimius_for_snli.md)
+
+**4. Low-resource Language Understanding**
+
+## Collect and Plot Results
+
+Once the networks are trained and the results are saved, we extracted key results using Python script. The results can be plotted using the included IPython notebook `plots/main_plots.ipynb`.
+Start the IPython Notebook server:
+
+```
+$ cd plots
+$ ipython notebook
+```
+
+Select the `main_plots.ipynb` notebook and execute the included
+code. Note that without modification, we have copyed our extracted results into the notebook, and script will output figures in the paper. If you've run your own training and wish to plot results, you'll have to organize your results in the same format instead.
+
+
+## Questions?
+
+Please drop me ([Chunyuan](http://chunyuan.li/)) a line if you have any questions.
+
+
+```
+@inproceedings{li2020_Optimus,
+  title={Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space},
+  author={Li, Chunyuan and Gao, Xiang and Li, Yuan and Li, Xiujun and Peng, Baolin and Zhang, Yizhe and Gao, Jianfeng},
+  booktitle={EMNLP},
+  year={2020}
+}
+```
+
+
diff --git a/Optimus/code/README.md b/Optimus/code/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..47008b6b3fee8c124fb9d4a8f929a3a5d20415a2
--- /dev/null
+++ b/Optimus/code/README.md
@@ -0,0 +1,41 @@
+## Set up Environment
+
+Pull docker from Docker Hub at: chunyl/pytorch-transformers:v2
+
+Edit the project path to the absolute path on your computer by changing the "SCRIPTPATH" in [run_docker.sh](./scripts/scripts_docker/run_docker.sh)
+
+In this directory ("code"), and run docker
+
+    sh scripts/scripts_docker/run_docker.sh
+    
+    
+
+  
+## Fine-tune Language Models
+
+    sh scripts/scripts_local/run_ft_lm_vae_optimus.sh
+    
+    
+The main training script is [`run_lm_vae_training.py`](./examples/big_ae/run_lm_vae_training.py) and conducts the fine-tuning loop, taking the following options (among others) as arguments:
+
+- `--checkpoint_dir`: the folder that the pre-trained Optimus is saved.
+- `--gloabl_step_eval`: it specifies the checkpoint (the steps that Optimus is trained).
+- `--train_data_file` and `--eval_data_file`: the path for training and testing datasets for the downstream fine-tuning.
+- `--dataset`: the dataset for fine-tuning. such as `Penn`
+- `--num_train_epochs`: number of training epochs (type=int); default 1.
+- `--dim_target_kl`:   the hyper-paramter used in dimension-wise thresholding used in fine-tuning(type=float); default 0.5.
+- `--beta`:   the maximum beta value used in cyclical annealing schedule used in fine-tuning(type=float); default 1.0.
+- `--ratio_zero`:   the proportion of beta=0 in one period for fine-tuning(type=float); default 0.5
+- `--ratio_increase`:  the proportion of beta that increases from 0 to the maximum value in one period in cyclical annealing schedule used in fine-tuning(type=float); default 0.25.
+
+
+For more options, please see [`run_lm_vae_training.py`](./examples/big_ae/run_lm_vae_training.py) and  see the examples we provided in [`run_ft_lm_vae_optimus.sh`](./scripts/scripts_local/run_ft_lm_vae_optimus.sh), or [more running scripts we used to run the code on a cluster](./scripts/scripts_philly).
+
+
+## Play with the latent space
+
+    sh scripts/scripts_local/eval_optimus_latent_space.sh
+    
+The main training script is [`run_latent_generation.py`](./examples/big_ae/run_latent_generation.py) and evaluates the various ways to generate text conditioned on latent vectors, taking the following options (among others) as arguments:
+
+- `--play_mode`:  The current scripts supports two ways to play with the pre-trained VAE models: [`reconstrction`, `interpolation`]
diff --git a/Optimus/code/app.py b/Optimus/code/app.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/Optimus/code/examples/README.md b/Optimus/code/examples/README.md
new file mode 100755
index 0000000000000000000000000000000000000000..a41c117078a63aa88ffa32dd52525ca12bf1124d
--- /dev/null
+++ b/Optimus/code/examples/README.md
@@ -0,0 +1,392 @@
+# Examples
+
+In this section a few examples are put together. All of these examples work for several models, making use of the very
+similar API between the different models.
+
+| Section                    | Description                                                                                                                                                |
+|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
+| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.                                         |
+| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision.                              |
+| [SQuAD](#squad) | Using BERT for question answering, examples with distributed training.                                                                                  |
+| [Multiple Choice](#multiple choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. 
+
+## Language model fine-tuning
+
+Based on the script [`run_lm_finetuning.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_lm_finetuning.py).
+
+Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT 
+to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa 
+are fine-tuned using a masked language modeling (MLM) loss.
+
+Before running the following example, you should get a file that contains text on which the language model will be
+fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
+
+We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
+text that will be used for evaluation.
+
+### GPT-2/GPT and causal language modeling
+
+The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
+the tokenization). The loss here is that of causal language modeling.
+
+```bash
+export TRAIN_FILE=/path/to/dataset/wiki.train.raw
+export TEST_FILE=/path/to/dataset/wiki.test.raw
+
+python run_lm_finetuning.py \
+    --output_dir=output \
+    --model_type=gpt2 \
+    --model_name_or_path=gpt2 \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE
+```
+
+This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
+a score of ~20 perplexity once fine-tuned on the dataset.
+
+### RoBERTa/BERT and masked language modeling
+
+The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
+as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
+pre-training: masked language modeling. 
+
+In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
+slightly slower (over-fitting takes more epochs).
+
+We use the `--mlm` flag so that the script may change its loss function.
+
+```bash
+export TRAIN_FILE=/path/to/dataset/wiki.train.raw
+export TEST_FILE=/path/to/dataset/wiki.test.raw
+
+python run_lm_finetuning.py \
+    --output_dir=output \
+    --model_type=roberta \
+    --model_name_or_path=roberta-base \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --mlm
+```
+
+## Language generation
+
+Based on the script [`run_generation.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_generation.py).
+
+Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.
+A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
+can try out the different models available in the library.
+
+Example usage:
+
+```bash
+python run_generation.py \
+    --model_type=gpt2 \
+    --model_name_or_path=gpt2
+```
+
+## GLUE
+
+Based on the script [`run_glue.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py).
+
+Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding 
+Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa. 
+
+GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
+uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train
+batch size of 24. Some of these tasks have a small dataset and training can lead to high variance in the results
+between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
+
+| Task  | Metric                       | Result      |
+|-------|------------------------------|-------------|
+| CoLA  | Matthew's corr               | 48.87       |
+| SST-2 | Accuracy                     | 91.74       |
+| MRPC  | F1/Accuracy                  | 90.70/86.27 |
+| STS-B | Person/Spearman corr.        | 91.39/91.04 |
+| QQP   | Accuracy/F1                  | 90.79/87.66 |
+| MNLI  | Matched acc./Mismatched acc. | 83.70/84.83 |
+| QNLI  | Accuracy                     | 89.31       |
+| RTE   | Accuracy                     | 71.43       |
+| WNLI  | Accuracy                     | 43.66       |
+
+Some of these results are significantly different from the ones reported on the test set
+of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
+
+Before running anyone of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+
+```bash
+export GLUE_DIR=/path/to/glue
+export TASK_NAME=MRPC
+
+python run_glue.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name $TASK_NAME \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/$TASK_NAME \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/$TASK_NAME/
+```
+
+where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
+
+The dev set results will be present within the text file `eval_results.txt` in the specified output_dir. 
+In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate 
+output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
+
+The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, 
+CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being 
+said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well, 
+since the data processor for each task inherits from the base class DataProcessor.
+
+### MRPC
+
+#### Fine-tuning example
+
+The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less 
+than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
+
+Before running anyone of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python run_glue.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name MRPC \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/MRPC/ \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/mrpc_output/
+```
+
+Our test ran on a few seeds with [the original implementation hyper-
+parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation 
+results between 84% and 88%.
+
+#### Using Apex and mixed-precision
+
+Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install 
+[apex](https://github.com/NVIDIA/apex), then run the following example:
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python run_glue.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name MRPC \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/MRPC/ \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/mrpc_output/ \
+  --fp16
+```
+
+#### Distributed training
+
+Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
+reaches F1 > 92 on MRPC.
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python -m torch.distributed.launch \
+    --nproc_per_node 8 run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --task_name MRPC \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --data_dir $GLUE_DIR/MRPC/ \
+    --max_seq_length 128 \
+    --per_gpu_train_batch_size 8 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir /tmp/mrpc_output/
+```
+
+Training with these hyper-parameters gave us the following results:
+
+```bash
+acc = 0.8823529411764706
+acc_and_f1 = 0.901702786377709
+eval_loss = 0.3418912578906332
+f1 = 0.9210526315789473
+global_step = 174
+loss = 0.07231863956341798
+```
+
+### MNLI
+
+The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python -m torch.distributed.launch \
+    --nproc_per_node 8 run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --task_name mnli \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --data_dir $GLUE_DIR/MNLI/ \
+    --max_seq_length 128 \
+    --per_gpu_train_batch_size 8 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir output_dir \
+```
+
+The results  are the following:
+
+```bash
+***** Eval results *****
+  acc = 0.8679706601466992
+  eval_loss = 0.4911287787382479
+  global_step = 18408
+  loss = 0.04755385363816904
+
+***** Eval results *****
+  acc = 0.8747965825874695
+  eval_loss = 0.45516540421714036
+  global_step = 18408
+  loss = 0.04755385363816904
+```
+
+##Multiple Choice
+
+Based on the script [`run_multiple_choice.py`]().
+
+#### Fine-tuning on SWAG
+Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
+
+```
+#training on 4 tesla V100(16GB) GPUS
+export SWAG_DIR=/path/to/swag_data_dir
+python ./examples/single_model_scripts/run_multiple_choice.py \
+--model_type roberta \
+--task_name swag \
+--model_name_or_path roberta-base \
+--do_train \
+--do_eval \
+--do_lower_case \
+--data_dir $SWAG_DIR \
+--learning_rate 5e-5 \
+--num_train_epochs 3 \
+--max_seq_length 80 \
+--output_dir models_bert/swag_base \
+--per_gpu_eval_batch_size=16 \
+--per_gpu_train_batch_size=16 \
+--gradient_accumulation_steps 2 \
+--overwrite_output
+```
+Training with the defined hyper-parameters yields the following results:
+```
+***** Eval results *****
+eval_acc = 0.8338998300509847
+eval_loss = 0.44457291918821606
+```
+
+## SQuAD
+
+Based on the script [`run_squad.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_squad.py).
+
+#### Fine-tuning on SQuAD
+
+This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) 
+on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a 
+$SQUAD_DIR directory.
+
+* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
+* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
+* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
+
+```bash
+export SQUAD_DIR=/path/to/SQUAD
+
+python run_squad.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --train_file $SQUAD_DIR/train-v1.1.json \
+  --predict_file $SQUAD_DIR/dev-v1.1.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 2.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir /tmp/debug_squad/
+```
+
+Training with the previously defined hyper-parameters yields the following results:
+
+```bash
+f1 = 88.52
+exact_match = 81.22
+```
+
+#### Distributed training
+
+
+Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --train_file $SQUAD_DIR/train-v1.1.json \
+    --predict_file $SQUAD_DIR/dev-v1.1.json \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --output_dir ../models/wwm_uncased_finetuned_squad/ \
+    --per_gpu_train_batch_size 24 \
+    --gradient_accumulation_steps 12
+```
+
+Training with the previously defined hyper-parameters yields the following results:
+
+```bash
+f1 = 93.15
+exact_match = 86.91
+```
+
+This fine-tuneds model is available as a checkpoint under the reference
+`bert-large-uncased-whole-word-masking-finetuned-squad`.
+
diff --git a/Optimus/code/examples/__pycache__/utils_glue.cpython-37.pyc b/Optimus/code/examples/__pycache__/utils_glue.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b26db114c23932844fa75a6f6169a1a6e04d49c5
Binary files /dev/null and b/Optimus/code/examples/__pycache__/utils_glue.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/__pycache__/grad_app.cpython-310.pyc b/Optimus/code/examples/big_ae/__pycache__/grad_app.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a8e5ed308b2acc328f5cb173d1bbd819d3d6a406
Binary files /dev/null and b/Optimus/code/examples/big_ae/__pycache__/grad_app.cpython-310.pyc differ
diff --git a/Optimus/code/examples/big_ae/__pycache__/utils.cpython-37.pyc b/Optimus/code/examples/big_ae/__pycache__/utils.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..111b63a83c0c7746d77fe8d22b0f214fb02a42a3
Binary files /dev/null and b/Optimus/code/examples/big_ae/__pycache__/utils.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/debug_data.py b/Optimus/code/examples/big_ae/debug_data.py
new file mode 100755
index 0000000000000000000000000000000000000000..28005edf82bbde7ee4fb2edb594fa48d124a4849
--- /dev/null
+++ b/Optimus/code/examples/big_ae/debug_data.py
@@ -0,0 +1,6 @@
+import torch
+import os
+
+output_dir = "../output/philly_rr1_vae_wikipedia_pretraining_2nd_file"
+
+data = torch.load(os.path.join(output_dir, 'batch_debug_6621.pt')
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/eval_dialog_multi_response.py b/Optimus/code/examples/big_ae/eval_dialog_multi_response.py
new file mode 100755
index 0000000000000000000000000000000000000000..8ff9554c78c1c8f0a9f74c017f78098134843893
--- /dev/null
+++ b/Optimus/code/examples/big_ae/eval_dialog_multi_response.py
@@ -0,0 +1,378 @@
+import numpy as np
+import torch
+import torch.nn.functional as F
+from nltk.translate.bleu_score import sentence_bleu
+from nltk.translate.bleu_score import SmoothingFunction
+from sklearn.metrics.pairwise import cosine_similarity as cosine
+from collections import Counter
+import os, pickle, pdb
+
+class Metrics:
+    # based on https://raw.githubusercontent.com/guxd/DialogWAE/29f206af05bfe5fe28fec4448e208310a7c9258d/experiments/metrics.py
+    
+    def __init__(self, path_word2vec='../data/datasets/dailydialog_data/glove.twitter.27B.200d.txt'):
+        """
+        :param word2vec - a numpy array of word2vec with shape [vocab_size x emb_size]
+        """
+        super(Metrics, self).__init__()
+        self.load_word2vec(path_word2vec)
+        #self.word2vec = dict()
+
+    def load_word2vec(self, path_word2vec):
+        path_pkl = path_word2vec + '.pkl'
+        if os.path.exists(path_pkl):
+            print('loading word2vec from '+path_pkl)
+            self.word2vec = pickle.load(open(path_pkl, 'rb'))
+        else:
+            self.word2vec = dict()
+            for i, line in enumerate(open(path_word2vec, encoding='utf-8')):
+                ss = line.strip('\n').split() 
+                self.word2vec[ss[0]] = [float(v) for v in ss[1:]]
+                if i % 1e4 == 0:
+                    print('processed %ik word2vec'%(i/1e3))
+            print('dumping word2vec to '+path_pkl)
+            pickle.dump(self.word2vec, open(path_pkl, 'wb'))
+        self.embed_dim = len(list(self.word2vec.values())[0])
+        print('loaded %i word2vec of dim %i'%(len(self.word2vec), self.embed_dim))
+
+    def embedding(self, seqs): 
+        # note: different from original implementation
+        batch_size, seqlen = seqs.shape
+        embs = np.zeros([batch_size, seqlen, self.embed_dim])
+        for i in range(batch_size):
+            for j in range(seqlen):
+                w = seqs[i,j] 
+                if w != '' and w in self.word2vec:
+                    embs[i, j, :] = self.word2vec[w]
+        return embs
+
+    
+    def extrema(self, embs, lens): # embs: [batch_size x seq_len x emb_size]  lens: [batch_size]
+        """
+        computes the value of every single dimension in the word vectors which has the greatest
+        difference from zero.
+        :param seq: sequence
+        :param seqlen: length of sequence
+        """
+        # Find minimum and maximum value for every dimension in predictions
+        batch_size, seq_len, emb_size = embs.shape
+        max_mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            max_mask[i,:length,:]=1
+        min_mask = 1-max_mask
+        seq_max = (embs*max_mask).max(1) # [batch_sz x emb_sz]
+        seq_min = (embs+min_mask).min(1)
+        # Find the maximum absolute value in min and max data
+        comp_mask = seq_max >= np.abs(seq_min)# [batch_sz x emb_sz]
+        # Add vectors for finding final sequence representation for predictions
+        extrema_emb = seq_max* comp_mask + seq_min* np.logical_not(comp_mask)
+        return extrema_emb
+    
+    def mean(self, embs, lens):
+        batch_size, seq_len, emb_size=embs.shape
+        mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            mask[i,:length,:]=1
+        return (embs*mask).sum(1)/(mask.sum(1)+1e-8)
+
+    def sim_bleu(self, hyps, ref):
+        """
+        :param ref - a list of tokens of the reference
+        :param hyps - a list of tokens of the hypothesis
+    
+        :return maxbleu - recall bleu
+        :return avgbleu - precision bleu
+        """
+        scores = []
+        for hyp in hyps:
+            try:
+                scores.append(sentence_bleu([ref], hyp, smoothing_function=SmoothingFunction().method7,
+                                        weights=[1./3, 1./3, 1./3]))
+            except:
+                scores.append(0.0)
+        return np.max(scores), np.mean(scores)
+
+
+    def sim_bow(self, pred, pred_lens, ref, ref_lens):
+        """
+        :param pred - ndarray [batch_size x seqlen]
+        :param pred_lens - list of integers
+        :param ref - ndarray [batch_size x seqlen]
+        """
+        # look up word embeddings for prediction and reference
+        emb_pred = self.embedding(pred) # [batch_sz x seqlen1 x emb_sz]
+        emb_ref = self.embedding(ref) # [batch_sz x seqlen2 x emb_sz]
+        
+        ext_emb_pred=self.extrema(emb_pred, pred_lens)
+        ext_emb_ref=self.extrema(emb_ref, ref_lens)
+        bow_extrema=cosine(ext_emb_pred, ext_emb_ref) # [batch_sz_pred x batch_sz_ref]
+        
+        avg_emb_pred = self.mean(emb_pred, pred_lens) # Calculate mean over seq
+        avg_emb_ref = self.mean(emb_ref, ref_lens) 
+        bow_avg = cosine(avg_emb_pred, avg_emb_ref) # [batch_sz_pred x batch_sz_ref]
+
+        
+        batch_pred, seqlen_pred, emb_size=emb_pred.shape
+        batch_ref, seqlen_ref, emb_size=emb_ref.shape
+        cos_sim = cosine(emb_pred.reshape((-1, emb_size)), emb_ref.reshape((-1, emb_size))) # [(batch_sz*seqlen1)x(batch_sz*seqlen2)]
+        cos_sim = cos_sim.reshape((batch_pred, seqlen_pred, batch_ref, seqlen_ref))
+        # Find words with max cosine similarity
+        max12 = cos_sim.max(1).mean(2) # max over seqlen_pred
+        max21 = cos_sim.max(3).mean(1) # max over seqlen_ref
+        bow_greedy=(max12+max21)/2 # [batch_pred x batch_ref(1)]
+        return np.max(bow_extrema), np.max(bow_avg), np.max(bow_greedy)
+    
+    def div_distinct(self, seqs, seq_lens):
+        """
+        distinct-1 distinct-2 metrics for diversity measure proposed 
+        by Li et al. "A Diversity-Promoting Objective Function for Neural Conversation Models"
+        we counted numbers of distinct unigrams and bigrams in the generated responses 
+        and divide the numbers by total number of unigrams and bigrams. 
+        The two metrics measure how informative and diverse the generated responses are. 
+        High numbers and high ratios mean that there is much content in the generated responses, 
+        and high numbers further indicate that the generated responses are long
+        """
+        batch_size = seqs.shape[0]
+        intra_dist1, intra_dist2=np.zeros(batch_size), np.zeros(batch_size)
+        
+        n_unigrams, n_bigrams, n_unigrams_total , n_bigrams_total = 0. ,0., 0., 0.
+        unigrams_all, bigrams_all = Counter(), Counter()
+        for b in range(batch_size):
+            unigrams= Counter([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams = Counter([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            intra_dist1[b]=(len(unigrams.items())+1e-12)/(seq_lens[b]+1e-5)
+            intra_dist2[b]=(len(bigrams.items())+1e-12)/(max(0, seq_lens[b]-1)+1e-5)
+            
+            unigrams_all.update([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams_all.update([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            n_unigrams_total += seq_lens[b]
+            n_bigrams_total += max(0, seq_lens[b]-1)
+
+        inter_dist1 = (len(unigrams_all.items())+1e-12)/(n_unigrams_total+1e-5)
+        inter_dist2 = (len(bigrams_all.items())+1e-12)/(n_bigrams_total+1e-5)
+        return intra_dist1, intra_dist2, inter_dist1, inter_dist2
+
+import pdb
+
+def eval_multi_ref(path, path_multi_ref=None):
+    """
+    based on: https://github.com/guxd/DialogWAE/blob/29f206af05bfe5fe28fec4448e208310a7c9258d/sample.py
+    path:   each line is '\t'.join([src, ref, hyp])
+    path_multi_ref:   each line is '\t'.join([src, hyp])
+    the order of unique src appeared in `path_multi_ref` should be the same as that in `path`
+    """
+    metrics = Metrics()
+    d_ref = dict()
+    d_hyp = dict()
+    src2ix = dict()
+    ix2src = dict()
+    ix = 0
+    for line in open(path, encoding='utf-8'):
+        line = line.strip('\n').strip()
+        if len(line) == 0:
+            continue
+        
+        # pdb.set_trace()
+        src, ref, hyp = line.split('\t')
+        #src, ref = line.split('\t'); hyp = ref
+        src = src.replace(' EOS ',' [SEP] ').strip()
+        ref = ref.strip().split()
+        hyp = hyp.strip().split()
+        if src not in d_ref:
+            d_ref[src] = ref
+            d_hyp[src] = [hyp]
+            src2ix[src] = ix
+            ix2src[ix] = src
+            ix += 1
+        else:
+            d_hyp[src].append(hyp)
+    print('loaded %i src-ref-hyp tuples'%(len(d_ref)))
+        
+    def chr_only(s):
+        ret = ''
+        for c in s:
+            if c.isalpha():
+                ret += c
+        return ret
+
+    if path_multi_ref is not None:
+        set_src4multiref = set()
+        ix = -1
+        d_multi_ref = dict()
+        for line in open(path_multi_ref, encoding='utf-8'):
+            line = line.strip('\n').strip()
+            if len(line) == 0:
+                continue
+            src4multiref, ref = line.split('\t')[:2]
+            src4multiref = src4multiref.replace(' EOS ', ' ').replace(' [SEP] ',' ').strip()
+            ref = ref.strip().split()
+            if src4multiref not in set_src4multiref:
+                set_src4multiref.add(src4multiref)
+                ix += 1
+                src = ix2src[ix]
+                id_hyp = chr_only(src)
+                id_multiref = chr_only(src4multiref)
+                if id_multiref != id_hyp:
+                    print('[ERROR] cannot match src4multiref and src4hyp')
+                    print('src4multiref:', src4multiref)
+                    print('src4hyp:', ix2src[ix])
+                    # pdb.set_trace()
+                    raise ValueError
+                d_multi_ref[src] = [ref]
+            else:
+                d_multi_ref[src].append(ref)
+            
+        n_ref = [len(d_multi_ref[k]) for k in d_multi_ref]
+        print('loaded %i src with multi-ref, avg n_ref = %.3f'%(len(d_multi_ref), np.mean(n_ref)))
+
+    n_miss = 0
+    for src in d_ref:
+        if src not in d_multi_ref:
+            n_miss += 1
+            print('[WARNING] cannot find multiref for src: '+src)
+            d_multi_ref[src] = [d_ref[src]]
+    if n_miss > 5:
+        raise ValueError
+    
+    n = len(d_ref)
+    print(path)
+    print('n_src\t%i'%n)
+
+    avg_lens = 0
+    maxbleu = 0
+    avgbleu = 0
+    intra_dist1, intra_dist2, inter_dist1, inter_dist2 = 0,0,0,0
+    bow_extrema, bow_avg, bow_greedy = 0,0,0
+    for src in d_ref:
+
+        # BLEU ----
+
+        if path_multi_ref is None:
+            m, a = metrics.sim_bleu(d_hyp[src], d_ref[src])
+        else:
+            n_ref = len(d_multi_ref[src])
+            m, a = 0, 0
+            for ref in d_multi_ref[src]:
+                _m, _a = metrics.sim_bleu(d_hyp[src], ref)
+                m += _m
+                a += _a
+            m /= n_ref
+            a /= n_ref
+
+        maxbleu += m
+        avgbleu += a
+
+        # diversity ----
+
+        seq_len = [len(hyp) for hyp in d_hyp[src]]
+        max_len = max(seq_len)
+        seqs = []
+        for hyp in d_hyp[src]:
+            padded = hyp + [''] * (max_len - len(hyp))
+            seqs.append(np.reshape(padded, [1, -1]))
+        seqs = np.concatenate(seqs, axis=0)
+        intra1, intra2, inter1, inter2 = metrics.div_distinct(seqs, seq_len)
+        intra_dist1 += np.mean(intra1)
+        intra_dist2 += np.mean(intra2)
+        inter_dist1 += inter1
+        inter_dist2 += inter2
+
+        avg_lens += np.mean(seq_len)
+        
+        # BOW ----
+
+        def calc_bow(ref):
+            n_hyp = len(d_hyp[src])
+            seqs_ref = np.concatenate([np.reshape(ref, [1,-1])] * n_hyp, axis=0)
+            seq_len_ref = [len(ref)] * n_hyp
+            return metrics.sim_bow(seqs, seq_len, seqs_ref, seq_len_ref)
+
+        if path_multi_ref is None:
+            extrema, avg, greedy = calc_bow(d_ref[src])
+        else:
+            extrema, avg, greedy = 0, 0, 0
+            for ref in d_multi_ref[src]:
+                e, a, g = calc_bow(ref)
+                extrema += e
+                avg += a
+                greedy += g
+            extrema /= n_ref
+            avg /= n_ref
+            greedy /= n_ref
+
+        bow_extrema += extrema
+        bow_avg += avg
+        bow_greedy += greedy
+
+    recall_bleu = maxbleu/n
+    prec_bleu = avgbleu/n
+    f1 = 2*(prec_bleu*recall_bleu) / (prec_bleu+recall_bleu+10e-12)
+    
+    print('BLEU')
+    print('  R\t%.3f'%recall_bleu)
+    print('  P\t%.3f'%prec_bleu)
+    print('  F1\t%.3f'%f1)
+    print('BOW')
+    print('  A\t%.3f'%(bow_avg/n))
+    print('  E\t%.3f'%(bow_extrema/n))
+    print('  G\t%.3f'%(bow_greedy/n))
+    print('intra_dist')
+    print('  1\t%.3f'%(intra_dist1/n))
+    print('  2\t%.3f'%(intra_dist2/n))
+    print('inter_dist')
+    print('  1\t%.3f'%(inter_dist1/n))
+    print('  2\t%.3f'%(inter_dist2/n))
+    print('avg_L\t%.1f'%(avg_lens/n))
+
+    results = {
+        "BLEU_R": recall_bleu, "BLEU_P": prec_bleu, "BLEU_F1": f1, "BOW_A": bow_avg/n, "BOW_E": bow_extrema/n, "BOW_G": bow_greedy/n, "intra_dist1": intra_dist1/n, "intra_dist2": intra_dist2/n, "inter_dist1": inter_dist1/n, "inter_dist2": inter_dist2/n, "avg_L": avg_lens/n
+    }
+
+    return results
+
+
+def create_rand_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    srcs = []
+    refs = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        srcs.append(src.strip())
+        refs.append(ref.strip())
+    
+    hyps = set()
+    path = 'data/datasets/dailydialog_data/train.txt'
+    for line in open(path, encoding='utf-8'):
+        _, ref = line.strip('\n').split('\t')
+        hyps.add(ref)
+        if len(hyps) == len(srcs) *10:
+            print('collected training ref')
+            break
+    
+    hyps = list(hyps)
+    lines = []
+    j = 0
+    for i in range(len(srcs)):
+        lines += ['\t'.join([srcs[i], refs[i], hyp]) for hyp in hyps[j:j+10]]
+        j = j + 10
+    with open('out/rand.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+
+
+def create_human_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    lines = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        src = src.strip()
+        ref = ref.strip()
+        lines.append('\t'.join([src, ref, ref]))
+        
+    with open('out/human.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+
+
+if __name__ == "__main__":
+    path = 'D:/data/switchboard/test.txt.1ref'
+    path_multi_ref = 'D:/data/switchboard/test.txt'
+    eval_multi_ref(path_multi_ref, path)
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/eval_dialog_response.py b/Optimus/code/examples/big_ae/eval_dialog_response.py
new file mode 100755
index 0000000000000000000000000000000000000000..9de58b84640cb051274f3c72c9194f01c91f4ac3
--- /dev/null
+++ b/Optimus/code/examples/big_ae/eval_dialog_response.py
@@ -0,0 +1,295 @@
+import numpy as np
+import torch
+import torch.nn.functional as F
+from nltk.translate.bleu_score import sentence_bleu
+from nltk.translate.bleu_score import SmoothingFunction
+from sklearn.metrics.pairwise import cosine_similarity as cosine
+from collections import Counter
+import os, pickle
+
+class Metrics:
+    # based on https://raw.githubusercontent.com/guxd/DialogWAE/29f206af05bfe5fe28fec4448e208310a7c9258d/experiments/metrics.py
+    
+    def __init__(self, path_word2vec='../data/datasets/dailydialog_data/glove.twitter.27B.200d.txt'):
+        """
+        :param word2vec - a numpy array of word2vec with shape [vocab_size x emb_size]
+        """
+        self.path_word2vec = path_word2vec
+        super(Metrics, self).__init__()
+        self.load_word2vec(path_word2vec)
+
+    def load_word2vec(self, path_word2vec):
+        path_pkl = path_word2vec + '.pkl'
+        if os.path.exists(path_pkl):
+            print('loading word2vec from '+path_pkl)
+            self.word2vec = pickle.load(open(path_pkl, 'rb'))
+        else:
+            self.word2vec = dict()
+            for i, line in enumerate(open(path_word2vec, encoding='utf-8')):
+                ss = line.strip('\n').split() 
+                self.word2vec[ss[0]] = [float(v) for v in ss[1:]]
+                if i % 1e4 == 0:
+                    print('processed %ik word2vec'%(i/1e3))
+            print('dumping word2vec to '+path_pkl)
+            pickle.dump(self.word2vec, open(path_pkl, 'wb'))
+        # pdb.set_trace()
+        self.embed_dim = len(self.word2vec["."]) # len(self.word2vec.values()[0])
+        print('loaded %i word2vec of dim %i'%(len(self.word2vec), self.embed_dim))
+
+    def embedding(self, seqs): 
+        # note: different from original implementation
+        batch_size, seqlen = seqs.shape
+        embs = np.zeros([batch_size, seqlen, self.embed_dim])
+        for i in range(batch_size):
+            for j in range(seqlen):
+                w = seqs[i,j] 
+                if w != '' and w in self.word2vec:
+                    embs[i, j, :] = self.word2vec[w]
+        return embs
+
+    
+    def extrema(self, embs, lens): # embs: [batch_size x seq_len x emb_size]  lens: [batch_size]
+        """
+        computes the value of every single dimension in the word vectors which has the greatest
+        difference from zero.
+        :param seq: sequence
+        :param seqlen: length of sequence
+        """
+        # Find minimum and maximum value for every dimension in predictions
+        batch_size, seq_len, emb_size = embs.shape
+        max_mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            max_mask[i,:length,:]=1
+        min_mask = 1-max_mask
+        seq_max = (embs*max_mask).max(1) # [batch_sz x emb_sz]
+        seq_min = (embs+min_mask).min(1)
+        # Find the maximum absolute value in min and max data
+        comp_mask = seq_max >= np.abs(seq_min)# [batch_sz x emb_sz]
+        # Add vectors for finding final sequence representation for predictions
+        extrema_emb = seq_max* comp_mask + seq_min* np.logical_not(comp_mask)
+        return extrema_emb
+    
+    def mean(self, embs, lens):
+        batch_size, seq_len, emb_size=embs.shape
+        mask = np.zeros((batch_size, seq_len, emb_size), dtype=np.int)
+        for i,length in enumerate(lens):
+            mask[i,:length,:]=1
+        return (embs*mask).sum(1)/(mask.sum(1)+1e-8)
+
+    def sim_bleu(self, hyps, ref):
+        """
+        :param ref - a list of tokens of the reference
+        :param hyps - a list of tokens of the hypothesis
+    
+        :return maxbleu - recall bleu
+        :return avgbleu - precision bleu
+        """
+        scores = []
+        for hyp in hyps:
+            try:
+                scores.append(sentence_bleu([ref], hyp, smoothing_function=SmoothingFunction().method7,
+                                        weights=[1./3, 1./3, 1./3]))
+            except:
+                scores.append(0.0)
+        return np.max(scores), np.mean(scores)
+
+
+    def sim_bow(self, pred, pred_lens, ref, ref_lens):
+        """
+        :param pred - ndarray [batch_size x seqlen]
+        :param pred_lens - list of integers
+        :param ref - ndarray [batch_size x seqlen]
+        """
+        # look up word embeddings for prediction and reference
+        emb_pred = self.embedding(pred) # [batch_sz x seqlen1 x emb_sz]
+        emb_ref = self.embedding(ref) # [batch_sz x seqlen2 x emb_sz]
+        
+        ext_emb_pred=self.extrema(emb_pred, pred_lens)
+        ext_emb_ref=self.extrema(emb_ref, ref_lens)
+        bow_extrema=cosine(ext_emb_pred, ext_emb_ref) # [batch_sz_pred x batch_sz_ref]
+        
+        avg_emb_pred = self.mean(emb_pred, pred_lens) # Calculate mean over seq
+        avg_emb_ref = self.mean(emb_ref, ref_lens) 
+        bow_avg = cosine(avg_emb_pred, avg_emb_ref) # [batch_sz_pred x batch_sz_ref]
+
+        
+        batch_pred, seqlen_pred, emb_size=emb_pred.shape
+        batch_ref, seqlen_ref, emb_size=emb_ref.shape
+        cos_sim = cosine(emb_pred.reshape((-1, emb_size)), emb_ref.reshape((-1, emb_size))) # [(batch_sz*seqlen1)x(batch_sz*seqlen2)]
+        cos_sim = cos_sim.reshape((batch_pred, seqlen_pred, batch_ref, seqlen_ref))
+        # Find words with max cosine similarity
+        max12 = cos_sim.max(1).mean(2) # max over seqlen_pred
+        max21 = cos_sim.max(3).mean(1) # max over seqlen_ref
+        bow_greedy=(max12+max21)/2 # [batch_pred x batch_ref(1)]
+        return np.max(bow_extrema), np.max(bow_avg), np.max(bow_greedy)
+    
+    def div_distinct(self, seqs, seq_lens):
+        """
+        distinct-1 distinct-2 metrics for diversity measure proposed 
+        by Li et al. "A Diversity-Promoting Objective Function for Neural Conversation Models"
+        we counted numbers of distinct unigrams and bigrams in the generated responses 
+        and divide the numbers by total number of unigrams and bigrams. 
+        The two metrics measure how informative and diverse the generated responses are. 
+        High numbers and high ratios mean that there is much content in the generated responses, 
+        and high numbers further indicate that the generated responses are long
+        """
+        batch_size = seqs.shape[0]
+        intra_dist1, intra_dist2=np.zeros(batch_size), np.zeros(batch_size)
+        
+        n_unigrams, n_bigrams, n_unigrams_total , n_bigrams_total = 0. ,0., 0., 0.
+        unigrams_all, bigrams_all = Counter(), Counter()
+        for b in range(batch_size):
+            unigrams= Counter([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams = Counter([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            intra_dist1[b]=(len(unigrams.items())+1e-12)/(seq_lens[b]+1e-5)
+            intra_dist2[b]=(len(bigrams.items())+1e-12)/(max(0, seq_lens[b]-1)+1e-5)
+            
+            unigrams_all.update([tuple(seqs[b,i:i+1]) for i in range(seq_lens[b])])
+            bigrams_all.update([tuple(seqs[b,i:i+2]) for i in range(seq_lens[b]-1)])
+            n_unigrams_total += seq_lens[b]
+            n_bigrams_total += max(0, seq_lens[b]-1)
+
+        inter_dist1 = (len(unigrams_all.items())+1e-12)/(n_unigrams_total+1e-5)
+        inter_dist2 = (len(bigrams_all.items())+1e-12)/(n_bigrams_total+1e-5)
+        return intra_dist1, intra_dist2, inter_dist1, inter_dist2
+
+import pdb
+
+def eval_dialog_response(generated_text_file_path):
+    """
+    based on: https://github.com/guxd/DialogWAE/blob/29f206af05bfe5fe28fec4448e208310a7c9258d/sample.py
+    quoted from the DialogWAE paper: https://arxiv.org/pdf/1805.12352.pdf
+    * "For each test context, we sample 10 responses from the models and compute their BLEU scores"
+    * "We use Glove vectors" "For each test context, we report the maximum BOW embedding score among the 10 sampled responses."
+    * "intra-dist as the average of distinct values within each sampled response"
+    " "inter-dist as the distinct value among all sampled responses."
+    """
+    metrics = Metrics()
+    d_ref = dict()
+    d_hyp = dict()
+    for line in open(generated_text_file_path, encoding='utf-8'):
+        line = line.strip('\n').strip()
+        if len(line) == 0:
+            continue
+        src, ref, hyp = line.split('\t')
+        src = src.strip()
+        ref = ref.strip().split()
+        hyp = hyp.strip().split()
+        if src not in d_ref:
+            d_ref[src] = ref
+            d_hyp[src] = [hyp]
+        else:
+            d_hyp[src].append(hyp)
+    
+    n = len(d_ref)
+    print(generated_text_file_path)
+    print('n_src\t%i'%n)
+
+    avg_lens = 0
+    maxbleu = 0
+    avgbleu = 0
+    intra_dist1, intra_dist2, inter_dist1, inter_dist2 = 0,0,0,0
+    bow_extrema, bow_avg, bow_greedy = 0,0,0
+    for src in d_ref:
+        m, a = metrics.sim_bleu(d_hyp[src], d_ref[src])
+        maxbleu += m
+        avgbleu += a
+        
+        seq_len = [len(hyp) for hyp in d_hyp[src]]
+        max_len = max(seq_len)
+        seqs = []
+        for hyp in d_hyp[src]:
+            padded = hyp + [''] * (max_len - len(hyp))
+            seqs.append(np.reshape(padded, [1, -1]))
+        seqs = np.concatenate(seqs, axis=0)
+        intra1, intra2, inter1, inter2 = metrics.div_distinct(seqs, seq_len)
+        intra_dist1 += np.mean(intra1)
+        intra_dist2 += np.mean(intra2)
+        inter_dist1 += inter1
+        inter_dist2 += inter2
+        
+        n_hyp = len(d_hyp[src])
+        seqs_ref = np.concatenate([np.reshape(d_ref[src], [1,-1])] * n_hyp, axis=0)
+        seq_len_ref = [len(d_ref[src])] * n_hyp
+        if metrics.word2vec is not None:
+            extrema, avg, greedy = metrics.sim_bow(seqs, seq_len, seqs_ref, seq_len_ref)
+            bow_extrema += extrema
+            bow_avg += avg
+            bow_greedy += greedy
+
+        avg_lens += np.mean(seq_len)
+
+    recall_bleu = maxbleu/n
+    prec_bleu = avgbleu/n
+    f1 = 2*(prec_bleu*recall_bleu) / (prec_bleu+recall_bleu+10e-12)
+    
+    print('BLEU')
+    print('  R\t%.3f'%recall_bleu)
+    print('  P\t%.3f'%prec_bleu)
+    print('  F1\t%.3f'%f1)
+    print('BOW')
+    print('  A\t%.3f'%(bow_avg/n))
+    print('  E\t%.3f'%(bow_extrema/n))
+    print('  G\t%.3f'%(bow_greedy/n))
+    print('intra_dist')
+    print('  1\t%.3f'%(intra_dist1/n))
+    print('  2\t%.3f'%(intra_dist2/n))
+    print('inter_dist')
+    print('  1\t%.3f'%(inter_dist1/n))
+    print('  2\t%.3f'%(inter_dist2/n))
+    print('avg_L\t%.1f'%(avg_lens/n))
+
+    results = {
+        "BLEU_R": recall_bleu, "BLEU_P": prec_bleu, "BLEU_F1": f1, "BOW_A": bow_avg/n, "BOW_E": bow_extrema/n, "BOW_G": bow_greedy/n, "intra_dist1": intra_dist1/n, "intra_dist2": intra_dist2/n, "inter_dist1": inter_dist1/n, "inter_dist2": inter_dist2/n, "avg_L": avg_lens/n
+    }
+
+    return results
+
+
+
+def create_rand_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    srcs = []
+    refs = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        srcs.append(src.strip())
+        refs.append(ref.strip())
+    
+    hyps = set()
+    path = 'data/datasets/dailydialog_data/train.txt'
+    for line in open(path, encoding='utf-8'):
+        _, ref = line.strip('\n').split('\t')
+        hyps.add(ref)
+        if len(hyps) == len(srcs) *10:
+            print('collected training ref')
+            break
+    
+    hyps = list(hyps)
+    lines = []
+    j = 0
+    for i in range(len(srcs)):
+        lines += ['\t'.join([srcs[i], refs[i], hyp]) for hyp in hyps[j:j+10]]
+        j = j + 10
+    with open('out/rand.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+
+
+def create_human_baseline():
+    path = 'data/datasets/dailydialog_data/test.txt'
+    lines = []
+    for line in open(path, encoding='utf-8'):
+        src, ref = line.strip('\n').split('\t')
+        src = src.strip()
+        ref = ref.strip()
+        lines.append('\t'.join([src, ref, ref]))
+        
+    with open('out/human.tsv', 'w', encoding='utf-8') as f:
+        f.write('\n'.join(lines))
+
+
+if __name__ == "__main__":
+    #create_rand_baseline()
+    #create_human_baseline()
+    eval_dialog_response('out/eval_text_generation_results (1).txt')
+    #eval('out/rand.tsv')
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/grad_app.py b/Optimus/code/examples/big_ae/grad_app.py
new file mode 100644
index 0000000000000000000000000000000000000000..50e7950a0657713ede246014ef4861d0e1fa4128
--- /dev/null
+++ b/Optimus/code/examples/big_ae/grad_app.py
@@ -0,0 +1,486 @@
+# -*- coding: utf-8 -*-
+"""message_bottle.ipynb
+
+Automatically generated by Colab.
+
+Original file is located at
+    https://colab.research.google.com/drive/1I47sLakpuwERGzn-XoNct67mwiDS1mQD
+"""
+
+import matplotlib.pyplot as plt
+import matplotlib
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+
+from tqdm import tqdm, trange
+from types import SimpleNamespace
+
+import sys
+sys.path.append('/home/ryn_mote/Misc/generative_recommender/text_space/Optimus/code/examples/big_ae/')
+sys.path.append('/home/ryn_mote/Misc/generative_recommender/text_space/Optimus/code/')
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+
+from modules import VAE
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+torch.set_float32_matmul_precision('high')
+
+from tqdm import tqdm
+
+################################################
+
+
+
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None):
+    
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        while True:
+        # for _ in trange(length):
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+
+    return generated
+
+
+def latent_code_from_text(text,):# args):
+    tokenized1 = tokenizer_encoder.encode(text)
+    tokenized1 = [101] + tokenized1 + [102]
+    coded1 = torch.Tensor([tokenized1])
+    coded1 =torch.Tensor.long(coded1)
+    with torch.no_grad():
+        x0 = coded1
+        x0 = x0.to('cuda')
+        pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+        mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+        latent_z = mean.squeeze(1)  
+        coded_length = len(tokenized1)
+        return latent_z, coded_length
+
+# args
+def text_from_latent_code(latent_z):
+    past = latent_z
+    context_tokens = tokenizer_decoder.encode('<BOS>')
+
+    length = 128 # maximum length, but not used 
+    out = sample_sequence_conditional(
+        model=model_vae.decoder,
+        context=context_tokens,
+        past=past,
+        length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+        temperature=.2,
+        top_k=50,
+        top_p=.98,
+        device='cuda',
+        decoder_tokenizer = tokenizer_decoder
+    )
+    text_x1 = tokenizer_decoder.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+    text_x1 = text_x1.split()[1:-1]
+    text_x1 = ' '.join(text_x1)
+    return text_x1
+
+
+################################################
+# Load model
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+
+latent_size = 768
+model_path = '/home/ryn_mote/Misc/generative_recommender/text_space/1.0_checkpoint-31250/checkpoint-31250/checkpoint-full-31250/'
+encoder_path = '/home/ryn_mote/Misc/generative_recommender/text_space/1.0_checkpoint-31250/checkpoint-31250/checkpoint-encoder-31250/'
+decoder_path = '/home/ryn_mote/Misc/generative_recommender/text_space/1.0_checkpoint-31250/checkpoint-31250/checkpoint-decoder-31250/'
+block_size = 100
+
+# Load a trained Encoder model and vocabulary that you have fine-tuned
+encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES['bert']
+model_encoder = encoder_model_class.from_pretrained(encoder_path, latent_size=latent_size)
+tokenizer_encoder = encoder_tokenizer_class.from_pretrained('bert-base-cased', do_lower_case=True)
+
+model_encoder.to('cuda')
+if block_size <= 0:
+    block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+block_size = min(block_size, tokenizer_encoder.max_len_single_sentence)
+
+# Load a trained Decoder model and vocabulary that you have fine-tuned
+decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES['gpt2']
+model_decoder = decoder_model_class.from_pretrained(decoder_path, latent_size=latent_size)
+tokenizer_decoder = decoder_tokenizer_class.from_pretrained('gpt2', do_lower_case=False)
+model_decoder.to('cuda')
+if block_size <= 0:
+    block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+block_size = min(block_size, tokenizer_decoder.max_len_single_sentence)
+
+# Load full model
+output_full_dir = '/home/ryn_mote/Misc/generative_recommender/text_space/' 
+checkpoint = torch.load(os.path.join(model_path, 'training.bin'))
+
+# Chunyuan: Add Padding token to GPT2
+special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+print('We have added', num_added_toks, 'tokens to GPT2')
+model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+assert tokenizer_decoder.pad_token == '<PAD>'
+
+
+# Evaluation
+model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, SimpleNamespace(**{'latent_size': latent_size, 'device':'cuda'}))
+model_vae.load_state_dict(checkpoint['model_state_dict'])
+print("Pre-trained Optimus is successfully loaded")
+model_vae.to('cuda').to(torch.bfloat16)
+
+l = latent_code_from_text('A photo of a mountain.')[0]
+t = text_from_latent_code(l)
+print(t, l, l.shape)
+################################################
+
+import gradio as gr
+import numpy as np
+from sklearn.svm import SVC
+from sklearn.inspection import permutation_importance
+from sklearn import preprocessing
+import pandas as pd
+import random
+import time
+
+ 
+dtype = torch.bfloat16
+torch.set_grad_enabled(False)
+
+prompt_list = [p for p in list(set(
+                pd.read_csv('./twitter_prompts.csv').iloc[:, 1].tolist())) if type(p) == str]
+
+start_time = time.time()
+
+####################### Setup Model
+
+# TODO put back
+# @spaces.GPU()
+def generate(prompt, in_embs=None,):
+  if prompt != '':
+    print(prompt)
+    #in_embs = in_embs / in_embs.abs().max() * .15 if in_embs != None else None
+    in_embs = .9 * in_embs.to('cuda') + .5 * latent_code_from_text(prompt)[0] if in_embs != None else latent_code_from_text(prompt)[0]
+  else:
+    print('From embeds.')
+  in_embs = in_embs / in_embs.abs().max() * .6
+  in_embs = in_embs.to('cuda').to(torch.bfloat16)
+  plt.close('all')
+  plt.hist(np.array(in_embs.detach().to('cpu').to(torch.float)).flatten(), bins=5)
+  plt.savefig('real_im_emb_plot.jpg')
+    
+  
+  text = text_from_latent_code(in_embs)
+  in_embs = latent_code_from_text(text)[0]
+  print(text)
+  return text, in_embs.to('cpu')
+
+
+#######################
+
+# TODO add to state instead of shared across all
+glob_idx = 0
+
+def next_one(embs, ys, calibrate_prompts):
+    global glob_idx
+    glob_idx = glob_idx + 1
+
+    with torch.no_grad():
+        if len(calibrate_prompts) > 0:
+            print('######### Calibrating with sample prompts #########')
+            prompt = calibrate_prompts.pop(0)
+            text, img_embs = generate(prompt)
+            embs += img_embs
+            print(len(embs))
+            return text, embs, ys, calibrate_prompts
+        else:
+            print('######### Roaming #########')
+
+
+            # handle case where every instance of calibration prompts is 'Neither' or 'Like' or 'Dislike'
+            if len(list(set(ys))) <= 1:
+                embs.append(.01*torch.randn(latent_size))
+                embs.append(.01*torch.randn(latent_size))
+                ys.append(0)
+                ys.append(1)
+            if len(list(ys)) < 10:
+                embs += [.01*torch.randn(latent_size)] * 3
+                ys += [0] * 3
+
+            pos_indices = [i for i in range(len(embs)) if ys[i] == 1]
+            neg_indices = [i for i in range(len(embs)) if ys[i] == 0]
+
+            # the embs & ys stay tied by index but we shuffle to drop randomly
+            random.shuffle(pos_indices)
+            random.shuffle(neg_indices)
+
+            #if len(pos_indices) - len(neg_indices) > 48 and len(pos_indices) > 80:
+            #    pos_indices = pos_indices[32:]
+            if len(neg_indices) - len(pos_indices) > 48/16 and len(pos_indices) > 6:
+                pos_indices = pos_indices[5:]
+            if len(neg_indices) - len(pos_indices) > 48/16 and len(neg_indices) > 6:
+                neg_indices = neg_indices[5:]
+
+
+            if len(neg_indices) > 25:
+                neg_indices = neg_indices[1:]
+
+            print(len(pos_indices), len(neg_indices))
+            indices = pos_indices + neg_indices
+
+            embs = [embs[i] for i in indices]
+            ys = [ys[i] for i in indices]
+
+
+            indices = list(range(len(embs)))
+
+            # also add the latest 0 and the latest 1
+            has_0 = False
+            has_1 = False
+            for i in reversed(range(len(ys))):
+                if ys[i] == 0 and has_0 == False:
+                    indices.append(i)
+                    has_0 = True
+                elif ys[i] == 1 and has_1 == False:
+                    indices.append(i)
+                    has_1 = True
+                if has_0 and has_1:
+                    break
+
+            # we may have just encountered a rare multi-threading diffusers issue (https://github.com/huggingface/diffusers/issues/5749);
+            # this ends up adding a rating but losing an embedding, it seems.
+            # let's take off a rating if so to continue without indexing errors.
+            if len(ys) > len(embs):
+                print('ys are longer than embs; popping latest rating')
+                ys.pop(-1)
+
+            feature_embs = np.array(torch.stack([embs[i].to('cpu') for i in indices]).to('cpu'))
+            scaler = preprocessing.StandardScaler().fit(feature_embs)
+            feature_embs = scaler.transform(feature_embs)
+            chosen_y = np.array([ys[i] for i in indices])
+
+            print('Gathering coefficients')
+            lin_class = SVC(max_iter=50000, kernel='linear', class_weight='balanced', C=.1).fit(feature_embs, chosen_y)
+            coef_ = torch.tensor(lin_class.coef_, dtype=torch.double)
+            print(coef_.shape, 'COEF')
+            print('Gathered')
+
+            rng_prompt = random.choice(prompt_list)
+            w = 1# if len(embs) % 2 == 0 else 0
+            im_emb = w * coef_.to(dtype=dtype)
+
+            prompt= '' if glob_idx % 3 != 0 else rng_prompt
+            text, im_emb = generate(prompt, im_emb)
+            embs += im_emb
+
+
+            return text, embs, ys, calibrate_prompts
+
+
+
+
+
+
+
+
+
+def start(_, embs, ys, calibrate_prompts):
+    text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+    return [
+            gr.Button(value='Like (L)', interactive=True),
+            gr.Button(value='Neither (Space)', interactive=True),
+            gr.Button(value='Dislike (A)', interactive=True),
+            gr.Button(value='Start', interactive=False),
+            text,
+            embs,
+            ys,
+            calibrate_prompts
+            ]
+
+
+def choose(text, choice, embs, ys, calibrate_prompts):
+    if choice == 'Like (L)':
+        choice = 1
+    elif choice == 'Neither (Space)':
+        embs = embs[:-1]
+        text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+        return text, embs, ys, calibrate_prompts
+    else:
+        choice = 0
+
+    # if we detected NSFW, leave that area of latent space regardless of how they rated chosen.
+    # TODO skip allowing rating
+    if text == None:
+        print('NSFW -- choice is disliked')
+        choice = 0
+
+    ys += [choice]*1
+    text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+    return text, embs, ys, calibrate_prompts
+
+css = '''.gradio-container{max-width: 700px !important}
+#description{text-align: center}
+#description h1, #description h3{display: block}
+#description p{margin-top: 0}
+.fade-in-out {animation: fadeInOut 3s forwards}
+@keyframes fadeInOut {
+    0% {
+      background: var(--bg-color);
+    }
+    100% {
+      background: var(--button-secondary-background-fill);
+    }
+}
+'''
+js_head = '''
+<script>
+document.addEventListener('keydown', function(event) {
+    if (event.key === 'a' || event.key === 'A') {
+        // Trigger click on 'dislike' if 'A' is pressed
+        document.getElementById('dislike').click();
+    } else if (event.key === ' ' || event.keyCode === 32) {
+        // Trigger click on 'neither' if Spacebar is pressed
+        document.getElementById('neither').click();
+    } else if (event.key === 'l' || event.key === 'L') {
+        // Trigger click on 'like' if 'L' is pressed
+        document.getElementById('like').click();
+    }
+});
+function fadeInOut(button, color) {
+  button.style.setProperty('--bg-color', color);
+  button.classList.remove('fade-in-out');
+  void button.offsetWidth; // This line forces a repaint by accessing a DOM property
+
+  button.classList.add('fade-in-out');
+  button.addEventListener('animationend', () => {
+    button.classList.remove('fade-in-out'); // Reset the animation state
+  }, {once: true});
+}
+document.body.addEventListener('click', function(event) {
+    const target = event.target;
+    if (target.id === 'dislike') {
+      fadeInOut(target, '#ff1717');
+    } else if (target.id === 'like') {
+      fadeInOut(target, '#006500');
+    } else if (target.id === 'neither') {
+      fadeInOut(target, '#cccccc');
+    }
+});
+
+</script>
+'''
+
+with gr.Blocks(css=css, head=js_head) as demo:
+    gr.Markdown('''# Compass
+### Generative Recommenders for Exporation of Text
+
+Explore the latent space without prompting based on your preferences. Learn more on [the write-up](https://rynmurdock.github.io/posts/2024/3/generative_recomenders/).
+    ''', elem_id="description")
+    embs = gr.State([])
+    ys = gr.State([])
+    calibrate_prompts = gr.State([
+    'the moon is melting into my glass of tea',
+    'a sea slug -- pair of claws scuttling -- jelly fish glowing',
+    'an adorable creature. It may be a goblin or a pig or a slug.',
+    'an animation about a gorgeous nebula',
+    'a sketch of an impressive mountain by da vinci',
+    'a watercolor painting: the octopus writhes',
+    ])
+    def l():
+        return None
+
+    with gr.Row(elem_id='output-image'):
+        text = gr.Textbox(interactive=False, elem_id="text")
+    with gr.Row(equal_height=True):
+        b3 = gr.Button(value='Dislike (A)', interactive=False, elem_id="dislike")
+        b2 = gr.Button(value='Neither (Space)', interactive=False, elem_id="neither")
+        b1 = gr.Button(value='Like (L)', interactive=False, elem_id="like")
+        b1.click(
+        choose,
+        [text, b1, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+        b2.click(
+        choose,
+        [text, b2, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+        b3.click(
+        choose,
+        [text, b3, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+    with gr.Row():
+        b4 = gr.Button(value='Start')
+        b4.click(start,
+                 [b4, embs, ys, calibrate_prompts],
+                 [b1, b2, b3, b4, text, embs, ys, calibrate_prompts])
+    with gr.Row():
+        html = gr.HTML('''<div style='text-align:center; font-size:20px'>You will calibrate for several prompts and then roam. </ div><br><br><br>
+<div style='text-align:center; font-size:14px'>Note that while the model is unlikely to produce NSFW text, this may still occur, and users should avoid NSFW content when rating.
+</ div>
+<br><br>
+<div style='text-align:center; font-size:14px'>Thanks to @multimodalart for their contributions to the demo, esp. the interface and @maxbittker for feedback.
+</ div>''')
+
+demo.launch(share=True)
diff --git a/Optimus/code/examples/big_ae/metrics.py b/Optimus/code/examples/big_ae/metrics.py
new file mode 100755
index 0000000000000000000000000000000000000000..e75b4a2c84d159cf5ae14ee94837e0c841856ace
--- /dev/null
+++ b/Optimus/code/examples/big_ae/metrics.py
@@ -0,0 +1,196 @@
+import os
+from multiprocessing import Pool
+import pdb
+import numpy as np
+import nltk
+nltk.download('punkt')
+
+from nltk.translate.bleu_score import SmoothingFunction
+
+try: 
+    from multiprocessing import cpu_count
+except: 
+    from os import cpu_count
+
+class Metrics(object):
+    def __init__(self):
+        self.name = 'Metric'
+
+    def get_name(self):
+        return self.name
+
+    def set_name(self, name):
+        self.name = name
+
+    def get_score(self):
+        pass
+
+
+class Bleu(Metrics):
+    def __init__(self, test_text='', real_text='', gram=3, num_real_sentences=500, num_fake_sentences=10000):
+        super(Bleu, self).__init__()
+        self.name = 'Bleu'
+        self.test_data = test_text
+        self.real_data = real_text
+        self.gram = gram
+        self.sample_size = num_real_sentences
+        self.reference = None
+        self.is_first = True
+        self.num_sentences = num_fake_sentences
+
+
+    def get_name(self):
+        return self.name
+
+    def get_score(self, is_fast=True, ignore=False):
+        if ignore:
+            return 0
+        if self.is_first:
+            self.get_reference()
+            self.is_first = False
+        if is_fast:
+            return self.get_bleu_fast()
+        return self.get_bleu_parallel()
+
+    # fetch REAL DATA
+    def get_reference(self):
+        if self.reference is None:
+            reference = list()
+            with open(self.real_data) as real_data:
+                for text in real_data:
+                    text = nltk.word_tokenize(text)
+                    reference.append(text)
+            self.reference = reference
+            return reference
+        else:
+            return self.reference
+
+    def get_bleu(self):
+        raise Exception('make sure you call BLEU paralell')
+        ngram = self.gram
+        bleu = list()
+        reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        with open(self.test_data) as test_data:
+            for hypothesis in test_data:
+                hypothesis = nltk.word_tokenize(hypothesis)
+                bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                                    smoothing_function=SmoothingFunction().method1))
+        return sum(bleu) / len(bleu)
+
+    def calc_bleu(self, reference, hypothesis, weight):
+        return nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                       smoothing_function=SmoothingFunction().method1)
+
+    def get_bleu_fast(self):
+        reference = self.get_reference()
+        reference = reference[0:self.sample_size]
+        return self.get_bleu_parallel(reference=reference)
+
+    def get_bleu_parallel(self, reference=None):
+        ngram = self.gram
+        if reference is None:
+            reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        pool = Pool(cpu_count())
+        result = list()
+        maxx = self.num_sentences
+        with open(self.test_data) as test_data:
+            for i, hypothesis in enumerate(test_data):
+                #print('i : {}'.format(i))
+                hypothesis = nltk.word_tokenize(hypothesis)
+                result.append(pool.apply_async(self.calc_bleu, args=(reference, hypothesis, weight)))
+                if i > maxx : break
+        score = 0.0
+        cnt = 0
+        for it, i in enumerate(result):
+            #print('i : {}'.format(it))
+            score += i.get()
+            cnt += 1
+        pool.close()
+        pool.join()
+        return score / cnt
+
+
+
+
+class SelfBleu(Metrics):
+    def __init__(self, test_text='', gram=3, model_path='', num_sentences=500):
+        super(SelfBleu, self).__init__()
+        self.name = 'Self-Bleu'
+        self.test_data = test_text
+        self.gram = gram
+        self.sample_size = num_sentences
+        self.reference = None
+        self.is_first = True
+
+
+    def get_name(self):
+        return self.name
+
+    def get_score(self, is_fast=True, ignore=False):
+        if ignore:
+            return 0
+        if self.is_first:
+            self.get_reference()
+            self.is_first = False
+        if is_fast:
+            return self.get_bleu_fast()
+        return self.get_bleu_parallel()
+
+    def get_reference(self):
+        if self.reference is None:
+            reference = list()
+            with open(self.test_data) as real_data:
+                for text in real_data:
+                    text = nltk.word_tokenize(text)
+                    reference.append(text)
+            self.reference = reference
+            return reference
+        else:
+            return self.reference
+
+    def get_bleu(self):
+        ngram = self.gram
+        bleu = list()
+        reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        with open(self.test_data) as test_data:
+            for hypothesis in test_data:
+                hypothesis = nltk.word_tokenize(hypothesis)
+                bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                                    smoothing_function=SmoothingFunction().method1))
+        return sum(bleu) / len(bleu)
+
+    def calc_bleu(self, reference, hypothesis, weight):
+        return nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
+                                                       smoothing_function=SmoothingFunction().method1)
+
+    def get_bleu_fast(self):
+        reference = self.get_reference()
+        # random.shuffle(reference)
+        reference = reference[0:self.sample_size]
+        return self.get_bleu_parallel(reference=reference)
+
+    def get_bleu_parallel(self, reference=None):
+        ngram = self.gram
+        if reference is None:
+            reference = self.get_reference()
+        weight = tuple((1. / ngram for _ in range(ngram)))
+        pool = Pool(cpu_count())
+        result = list()
+        sentence_num = len(reference)
+        for index in range(sentence_num):
+            #genious:
+            hypothesis = reference[index]
+            other = reference[:index] + reference[index+1:]
+            result.append(pool.apply_async(self.calc_bleu, args=(other, hypothesis, weight)))
+
+        score = 0.0
+        cnt = 0
+        for i in result:
+            score += i.get()
+            cnt += 1
+        pool.close()
+        pool.join()
+        return score / cnt
diff --git a/Optimus/code/examples/big_ae/modules/__init__.py b/Optimus/code/examples/big_ae/modules/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..46f9c1042373aa646f5a4ee3eb3ea422f51f1212
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/__init__.py
@@ -0,0 +1,7 @@
+from .encoders import *
+from .decoders import *
+from .vae import *
+from .utils import *
+from .spacefusion import *
+from .cara import *
+from .arae import *
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-310.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..60f1fe055a15aa5c4e07690cdbb884d2c27c5223
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-310.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-37.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6ec9b3056bc39177e6518b01b985b266ce1e5ef3
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/__init__.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-310.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..089ef323f0948bef68168e54f809db6bb57cb75d
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-310.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-37.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3389d724cd4df7547c226384572798c34b7183b7
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/arae.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-310.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2a7d90049876604ed3607c7a9542eb80fbc52d53
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-310.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-37.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3bbaf35740cd9a8ec1f085535956cf300172bc4d
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/cara.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-310.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..968d0f039fd6a64aeb5a6682d8b1ae5520ff4e11
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-310.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-37.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..003634076fb0330e56a01f033f3a4a2cab7f29f2
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/spacefusion.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-310.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..cfd50224ef2d4a608fc11733c1545facedcb73b3
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-310.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-37.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bf99c8a2321a501fa96f8408027dd156b6facd60
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/utils.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-310.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b77ce47957a4ab685e5f8f8c7b1b5efc60db16c8
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-310.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-37.pyc b/Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..aeaa12679289478e82000326a3392738ba04c6cd
Binary files /dev/null and b/Optimus/code/examples/big_ae/modules/__pycache__/vae.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/modules/arae.py b/Optimus/code/examples/big_ae/modules/arae.py
new file mode 100755
index 0000000000000000000000000000000000000000..cc4ee4e5f44c47e56903912f184d8be3345cf5a0
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/arae.py
@@ -0,0 +1,274 @@
+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+
+
+class ARAE(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(ARAE, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask=None):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+
+        # Latent discriminator
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+
+        # Latent classifier
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+
+        # Generate based on encoded z and gt labels. (reconstruction)
+        past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+
+        past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+
+        # Train a classifier in the observation space
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+
+        # Loss
+        loss = loss_rec + loss_encoder + loss_lsc + loss_lsd + loss_lsg + loss_cls
+
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+
+            return result
+
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+        }
+        return loss_dict, acc_dict
+
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated    # (B, seq_len)
+
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+
+        top_k = min(top_k, logits.size(-1))  # Safety check
+
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+
+            logits.masked_fill_(indices_to_remove, filter_value)
+
+        return logits
diff --git a/Optimus/code/examples/big_ae/modules/cara.py b/Optimus/code/examples/big_ae/modules/cara.py
new file mode 100755
index 0000000000000000000000000000000000000000..ef480533d32bf80310ce51b127b14a67def2a91c
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/cara.py
@@ -0,0 +1,374 @@
+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+
+
+class CARA(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(CARA, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+
+        #################### Latent discriminator for sampling from a simple distribution #################### 
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+
+        ####################  Latent classifier for disentanglement #################### 
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+
+
+        #################### Recontruction loss with latent z and label emb #################### 
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        # past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        # past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        past_sampled_label = sampled_label_emb
+
+        # Generate based on encoded z and gt labels. (reconstruction)
+        # past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        past_z = latent_z
+        # gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = gen_z    # (B, n_blocks * hidden_size)
+
+        # past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+
+        past = latent_z + label_emb # (B, n_blocks * hidden_size)
+
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+
+        ####################  Train a classifier in the observation space #################### 
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+
+        # Generate based on encoded z and sampled labels (attribute transfer)
+        # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        # at_generated_soft = self.sample_sequence_conditional_batch_soft(past=at_past, context=self.bos_token_id_list) # (B, seq_len, vocab_size)
+
+        # # Classifier on attribute transfer generated sentences. Train Generator on attribute transfer.
+        # at_soft_emb = torch.matmul(at_generated_soft, self.gpt_embeddings.weight)
+        # at_soft_encode = self.conv1(at_soft_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        # at_soft_encode = torch.mean(at_soft_encode, dim=-1)   # (B, dim_h)
+        # prob_at_soft_cls = self.classifier(at_soft_encode)    # (B, 1)
+        # if self.args.label_size <= 2:
+        #     prob_at_soft_cls = prob_at_soft_cls.squeeze(1)
+        #     loss_at_soft_cls = self.BCEWithLogitsLoss(prob_at_soft_cls, sampled_cond_labels.float())
+        #     pred_at_soft_cls = (prob_at_soft_cls >= 0).to(torch.long)
+        # else:
+        #     loss_at_soft_cls = self.CrossEntropyLoss(prob_at_soft_cls, sampled_cond_labels)
+        #     pred_at_soft_cls = torch.argmax(prob_at_soft_cls, dim=-1)
+        # acc_at_soft_cls = (pred_at_soft_cls == sampled_cond_labels).float()
+
+        # Loss
+        loss_latent_space = (loss_encoder + loss_lsc) + (loss_lsd + loss_lsg) + self.args.beta_cls * loss_cls # + loss_at_soft_cls
+        loss = loss_rec + 0.0 * loss_latent_space
+
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_past = past_z + past_sampled_label # (B, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            # cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_past = gen_past_z +  past_sampled_label # (B, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+
+            return result
+
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+                # 'loss_at_soft_cls': loss_at_soft_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+                # 'acc_at_soft_cls': acc_at_soft_cls,
+        }
+        return loss_dict, acc_dict
+
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+
+            # softmax sample
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated    # (B, seq_len)
+
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+
+        top_k = min(top_k, logits.size(-1))  # Safety check
+
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+
+            logits.masked_fill_(indices_to_remove, filter_value)
+
+        return logits
+
+    def sample_sequence_conditional_batch_soft(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device).unsqueeze(0).repeat(num_samples, 1)     # (B, 1)
+        context_soft = torch.FloatTensor(num_samples, self.decoder.config.vocab_size).zero_().to(device=past.device)    # (B, vocab_size)
+        context_soft.scatter_(1, context, 1)  # (B, vocab_size)
+        generated_soft = context_soft.unsqueeze(1) # (B, 1, vocab_size)
+
+        # with torch.no_grad():
+        while generated_soft.size(1) < self.args.block_size:    # generated_soft: (B, seq_len, vocab_size)
+            inputs = {'soft_ids': generated_soft, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]  # (B, seq_len, vocab_size)
+
+            # Gumbel softmax sample
+            next_tokens_soft = gumbel_softmax(logits=lm_logits[:, -1:, :], temperature=self.args.soft_temperature, hard=False)  # (B, 1, vocab_size)
+            generated_soft = torch.cat((generated_soft, next_tokens_soft), dim=1)   # (B, seq_len+1, vocab_size)
+
+            # # softmax sample
+            # next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            # filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            # filtered_logits = F.softmax(filtered_logits, dim=-1)
+            # next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            # generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            next_tokens = torch.argmax(next_tokens_soft, dim=-1)    # (B, 1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated_soft    # (B, seq_len, vocab_size)
+
+
+### Gumbel Softmax
+def gumbel_softmax(logits, temperature, hard=False):
+    """Sample from the Gumbel-Softmax distribution and optionally discretize.
+        Args:
+            logits: [..., n_class] unnormalized log-probs
+            temperature: non-negative scalar
+            hard: if True, take argmax, but differentiate w.r.t. soft sample y
+        Returns:
+            [..., n_class] sample from the Gumbel-Softmax distribution.
+            If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
+    """
+    y = gumbel_softmax_sample(logits, temperature)  # (..., n_class)
+
+    if hard:    # return onehot
+        shape = y.size()
+        _, ind = y.max(dim=-1)
+        y_hard = torch.zeros_like(y).view(-1, shape[-1])
+        y_hard.scatter_(1, ind.view(-1, 1), 1)  # one hot
+        y_hard = y_hard.view(*shape)
+        # Set gradients w.r.t. y_hard gradients w.r.t. y
+        y = (y_hard - y).detach() + y
+
+    return y    # (..., n_class)
+
+from torch.nn import functional as F
+def gumbel_softmax_sample(logits, temperature):
+    y = logits + sample_gumbel(logits.size(), logits.device)
+    return F.softmax(y / temperature, dim=-1)
+
+def sample_gumbel(shape, device, eps=1e-20):
+    U = torch.rand(shape).to(device=device)
+    return -torch.log(-torch.log(U + eps) + eps)
diff --git a/Optimus/code/examples/big_ae/modules/ctrl_gen.py b/Optimus/code/examples/big_ae/modules/ctrl_gen.py
new file mode 100755
index 0000000000000000000000000000000000000000..2b828132a0d208f9aacec9d70151b8e1562cfcc1
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/ctrl_gen.py
@@ -0,0 +1,371 @@
+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+
+
+class Ctrl_Gen(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(Ctrl_Gen, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+
+        # Latent discriminator
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+
+        # Latent classifier
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        # past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        # past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        past_sampled_label = sampled_label_emb
+
+        # Generate based on encoded z and gt labels. (reconstruction)
+        # past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        past_z = latent_z
+        # gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = gen_z    # (B, n_blocks * hidden_size)
+
+        # past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+
+        past = latent_z + label_emb # (B, n_blocks * hidden_size)
+
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+
+        # Train a classifier in the observation space
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+
+        # Generate based on encoded z and sampled labels (attribute transfer)
+        # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        # at_generated_soft = self.sample_sequence_conditional_batch_soft(past=at_past, context=self.bos_token_id_list) # (B, seq_len, vocab_size)
+
+        # # Classifier on attribute transfer generated sentences. Train Generator on attribute transfer.
+        # at_soft_emb = torch.matmul(at_generated_soft, self.gpt_embeddings.weight)
+        # at_soft_encode = self.conv1(at_soft_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        # at_soft_encode = torch.mean(at_soft_encode, dim=-1)   # (B, dim_h)
+        # prob_at_soft_cls = self.classifier(at_soft_encode)    # (B, 1)
+        # if self.args.label_size <= 2:
+        #     prob_at_soft_cls = prob_at_soft_cls.squeeze(1)
+        #     loss_at_soft_cls = self.BCEWithLogitsLoss(prob_at_soft_cls, sampled_cond_labels.float())
+        #     pred_at_soft_cls = (prob_at_soft_cls >= 0).to(torch.long)
+        # else:
+        #     loss_at_soft_cls = self.CrossEntropyLoss(prob_at_soft_cls, sampled_cond_labels)
+        #     pred_at_soft_cls = torch.argmax(prob_at_soft_cls, dim=-1)
+        # acc_at_soft_cls = (pred_at_soft_cls == sampled_cond_labels).float()
+
+        # Loss
+        loss = loss_rec + loss_encoder + loss_lsc + loss_lsd + loss_lsg + self.args.beta_cls * loss_cls # + loss_at_soft_cls
+
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_past = past_z + past_sampled_label # (B, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            # cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_past = gen_past_z +  past_sampled_label # (B, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+
+            return result
+
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+                # 'loss_at_soft_cls': loss_at_soft_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+                # 'acc_at_soft_cls': acc_at_soft_cls,
+        }
+        return loss_dict, acc_dict
+
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+
+            # softmax sample
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated    # (B, seq_len)
+
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+
+        top_k = min(top_k, logits.size(-1))  # Safety check
+
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+
+            logits.masked_fill_(indices_to_remove, filter_value)
+
+        return logits
+
+    def sample_sequence_conditional_batch_soft(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device).unsqueeze(0).repeat(num_samples, 1)     # (B, 1)
+        context_soft = torch.FloatTensor(num_samples, self.decoder.config.vocab_size).zero_().to(device=past.device)    # (B, vocab_size)
+        context_soft.scatter_(1, context, 1)  # (B, vocab_size)
+        generated_soft = context_soft.unsqueeze(1) # (B, 1, vocab_size)
+
+        # with torch.no_grad():
+        while generated_soft.size(1) < self.args.block_size:    # generated_soft: (B, seq_len, vocab_size)
+            inputs = {'soft_ids': generated_soft, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]  # (B, seq_len, vocab_size)
+
+            # Gumbel softmax sample
+            next_tokens_soft = gumbel_softmax(logits=lm_logits[:, -1:, :], temperature=self.args.soft_temperature, hard=False)  # (B, 1, vocab_size)
+            generated_soft = torch.cat((generated_soft, next_tokens_soft), dim=1)   # (B, seq_len+1, vocab_size)
+
+            # # softmax sample
+            # next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            # filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            # filtered_logits = F.softmax(filtered_logits, dim=-1)
+            # next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            # generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            next_tokens = torch.argmax(next_tokens_soft, dim=-1)    # (B, 1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated_soft    # (B, seq_len, vocab_size)
+
+
+### Gumbel Softmax
+def gumbel_softmax(logits, temperature, hard=False):
+    """Sample from the Gumbel-Softmax distribution and optionally discretize.
+        Args:
+            logits: [..., n_class] unnormalized log-probs
+            temperature: non-negative scalar
+            hard: if True, take argmax, but differentiate w.r.t. soft sample y
+        Returns:
+            [..., n_class] sample from the Gumbel-Softmax distribution.
+            If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
+    """
+    y = gumbel_softmax_sample(logits, temperature)  # (..., n_class)
+
+    if hard:    # return onehot
+        shape = y.size()
+        _, ind = y.max(dim=-1)
+        y_hard = torch.zeros_like(y).view(-1, shape[-1])
+        y_hard.scatter_(1, ind.view(-1, 1), 1)  # one hot
+        y_hard = y_hard.view(*shape)
+        # Set gradients w.r.t. y_hard gradients w.r.t. y
+        y = (y_hard - y).detach() + y
+
+    return y    # (..., n_class)
+
+from torch.nn import functional as F
+def gumbel_softmax_sample(logits, temperature):
+    y = logits + sample_gumbel(logits.size(), logits.device)
+    return F.softmax(y / temperature, dim=-1)
+
+def sample_gumbel(shape, device, eps=1e-20):
+    U = torch.rand(shape).to(device=device)
+    return -torch.log(-torch.log(U + eps) + eps)
diff --git a/Optimus/code/examples/big_ae/modules/decoders/dec_gpt2.py b/Optimus/code/examples/big_ae/modules/decoders/dec_gpt2.py
new file mode 100755
index 0000000000000000000000000000000000000000..9e1a725291a1883d8946f935467f73d3239fd4f0
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/decoders/dec_gpt2.py
@@ -0,0 +1,358 @@
+# import torch
+
+import time
+import argparse
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
+
+import numpy as np
+
+from .decoder import DecoderBase
+
+class LSTMDecoder(DecoderBase):
+    """LSTM decoder with constant-length data"""
+    def __init__(self, args, vocab, model_init, emb_init):
+        super(LSTMDecoder, self).__init__()
+        self.ni = args.ni
+        self.nh = args.dec_nh
+        self.nz = args.nz
+        self.vocab = vocab
+        self.device = args.device
+
+        # no padding when setting padding_idx to -1
+        self.embed = nn.Embedding(len(vocab), args.ni, padding_idx=-1)
+
+        self.dropout_in = nn.Dropout(args.dec_dropout_in)
+        self.dropout_out = nn.Dropout(args.dec_dropout_out)
+
+        # for initializing hidden state and cell
+        self.trans_linear = nn.Linear(args.nz, args.dec_nh, bias=False)
+
+        # concatenate z with input
+        self.lstm = nn.LSTM(input_size=args.ni + args.nz,
+                            hidden_size=args.dec_nh,
+                            num_layers=1,
+                            batch_first=True)
+
+        # prediction layer
+        self.pred_linear = nn.Linear(args.dec_nh, len(vocab), bias=False)
+
+        vocab_mask = torch.ones(len(vocab))
+        # vocab_mask[vocab['<pad>']] = 0
+        self.loss = nn.CrossEntropyLoss(weight=vocab_mask, reduce=False)
+
+        self.reset_parameters(model_init, emb_init)
+
+    def reset_parameters(self, model_init, emb_init):
+        # for name, param in self.lstm.named_parameters():
+        #     # self.initializer(param)
+        #     if 'bias' in name:
+        #         nn.init.constant_(param, 0.0)
+        #         # model_init(param)
+        #     elif 'weight' in name:
+        #         model_init(param)
+
+        # model_init(self.trans_linear.weight)
+        # model_init(self.pred_linear.weight)
+        for param in self.parameters():
+            model_init(param)
+        emb_init(self.embed.weight)
+
+    def sample_text(self, input, z, EOS, device):
+        sentence = [input]
+        max_index = 0
+
+        input_word = input
+        batch_size, n_sample, _ = z.size()
+        seq_len = 1
+        z_ = z.expand(batch_size, seq_len, self.nz)
+        seq_len = input.size(1)
+        softmax = torch.nn.Softmax(dim=0)
+        while max_index != EOS and len(sentence) < 100:
+            # (batch_size, seq_len, ni)
+            word_embed = self.embed(input_word)
+            word_embed = torch.cat((word_embed, z_), -1)
+            c_init = self.trans_linear(z).unsqueeze(0)
+            h_init = torch.tanh(c_init)
+            if len(sentence) == 1:
+                h_init = h_init.squeeze(dim=1)
+                c_init = c_init.squeeze(dim=1)
+                output, hidden = self.lstm.forward(word_embed, (h_init, c_init))
+            else:
+                output, hidden = self.lstm.forward(word_embed, hidden)
+            # (batch_size * n_sample, seq_len, vocab_size)
+            output_logits = self.pred_linear(output)
+            output_logits = output_logits.view(-1)
+            probs = softmax(output_logits)
+            # max_index = torch.argmax(output_logits)
+            max_index = torch.multinomial(probs, num_samples=1)
+            input_word = torch.tensor([[max_index]]).to(device)
+            sentence.append(max_index)
+        return sentence
+
+    def decode(self, input, z):
+        """
+        Args:
+            input: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        """
+
+        # not predicting start symbol
+        # sents_len -= 1
+
+        batch_size, n_sample, _ = z.size()
+        seq_len = input.size(1)
+
+        # (batch_size, seq_len, ni)
+        word_embed = self.embed(input)
+        word_embed = self.dropout_in(word_embed)
+
+        if n_sample == 1:
+            z_ = z.expand(batch_size, seq_len, self.nz)
+
+        else:
+            word_embed = word_embed.unsqueeze(1).expand(batch_size, n_sample, seq_len, self.ni) \
+                                   .contiguous()
+
+            # (batch_size * n_sample, seq_len, ni)
+            word_embed = word_embed.view(batch_size * n_sample, seq_len, self.ni)
+
+            z_ = z.unsqueeze(2).expand(batch_size, n_sample, seq_len, self.nz).contiguous()
+            z_ = z_.view(batch_size * n_sample, seq_len, self.nz)
+
+        # (batch_size * n_sample, seq_len, ni + nz)
+        word_embed = torch.cat((word_embed, z_), -1)
+
+        z = z.view(batch_size * n_sample, self.nz)
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+        # h_init = self.trans_linear(z).unsqueeze(0)
+        # c_init = h_init.new_zeros(h_init.size())
+        output, _ = self.lstm(word_embed, (h_init, c_init))
+
+        output = self.dropout_out(output)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.pred_linear(output)
+
+        return output_logits
+
+    def reconstruct_error(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+
+        #remove end symbol
+        src = x[:, :-1]
+
+        # remove start symbol
+        tgt = x[:, 1:]
+
+        batch_size, seq_len = src.size()
+        n_sample = z.size(1)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.decode(src, z)
+
+        if n_sample == 1:
+            tgt = tgt.contiguous().view(-1)
+        else:
+            # (batch_size * n_sample * seq_len)
+            tgt = tgt.unsqueeze(1).expand(batch_size, n_sample, seq_len) \
+                     .contiguous().view(-1)
+
+        # (batch_size * n_sample * seq_len)
+        loss = self.loss(output_logits.view(-1, output_logits.size(2)),
+                         tgt)
+
+
+        # (batch_size, n_sample)
+        return loss.view(batch_size, n_sample, -1).sum(-1)
+
+
+    def log_probability(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+
+        return -self.reconstruct_error(x, z)
+
+
+
+
+    def greedy_decode(self, z):
+        return self.sample_decode(z, greedy=True)
+
+    def sample_decode(self, z, greedy=False):
+        """sample/greedy decoding from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        batch_size = z.size(0)
+        decoded_batch = [[] for _ in range(batch_size)]
+
+        # (batch_size, 1, nz)
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+
+        decoder_hidden = (h_init, c_init)
+        decoder_input = torch.tensor([self.vocab["<s>"]] * batch_size, dtype=torch.long, device=self.device).unsqueeze(1)
+        end_symbol = torch.tensor([self.vocab["</s>"]] * batch_size, dtype=torch.long, device=self.device)
+
+        mask = torch.ones((batch_size), dtype=torch.uint8, device=self.device)
+        length_c = 1
+        while mask.sum().item() != 0 and length_c < 100:
+
+            # (batch_size, 1, ni) --> (batch_size, 1, ni+nz)
+            word_embed = self.embed(decoder_input)
+            word_embed = torch.cat((word_embed, z.unsqueeze(1)), dim=-1)
+
+            output, decoder_hidden = self.lstm(word_embed, decoder_hidden)
+
+            # (batch_size, 1, vocab_size) --> (batch_size, vocab_size)
+            decoder_output = self.pred_linear(output)
+            output_logits = decoder_output.squeeze(1)
+
+            # (batch_size)
+            if greedy:
+                max_index = torch.argmax(output_logits, dim=1)
+            else:
+                probs = F.softmax(output_logits, dim=1)
+                max_index = torch.multinomial(probs, num_samples=1).squeeze(1)
+
+            decoder_input = max_index.unsqueeze(1)
+            length_c += 1
+
+            for i in range(batch_size):
+                word = self.vocab.id2word(max_index[i].item())
+                if mask[i].item():
+                    decoded_batch[i].append(self.vocab.id2word(max_index[i].item()))
+
+            mask = torch.mul((max_index != end_symbol), mask)
+
+        return decoded_batch
+
+class VarLSTMDecoder(LSTMDecoder):
+    """LSTM decoder with constant-length data"""
+    def __init__(self, args, vocab, model_init, emb_init):
+        super(VarLSTMDecoder, self).__init__(args, vocab, model_init, emb_init)
+
+        self.embed = nn.Embedding(len(vocab), args.ni, padding_idx=vocab['<pad>'])
+        vocab_mask = torch.ones(len(vocab))
+        vocab_mask[vocab['<pad>']] = 0
+        self.loss = nn.CrossEntropyLoss(weight=vocab_mask, reduce=False)
+
+        self.reset_parameters(model_init, emb_init)
+
+    def decode(self, input, z):
+        """
+        Args:
+            input: tuple which contains x and sents_len
+                    x: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+            z: (batch_size, n_sample, nz)
+        """
+
+        input, sents_len = input
+
+        # not predicting start symbol
+        sents_len = sents_len - 1
+
+        batch_size, n_sample, _ = z.size()
+        seq_len = input.size(1)
+
+        # (batch_size, seq_len, ni)
+        word_embed = self.embed(input)
+        word_embed = self.dropout_in(word_embed)
+
+        if n_sample == 1:
+            z_ = z.expand(batch_size, seq_len, self.nz)
+
+        else:
+            word_embed = word_embed.unsqueeze(1).expand(batch_size, n_sample, seq_len, self.ni) \
+                                   .contiguous()
+
+            # (batch_size * n_sample, seq_len, ni)
+            word_embed = word_embed.view(batch_size * n_sample, seq_len, self.ni)
+
+            z_ = z.unsqueeze(2).expand(batch_size, n_sample, seq_len, self.nz).contiguous()
+            z_ = z_.view(batch_size * n_sample, seq_len, self.nz)
+
+        # (batch_size * n_sample, seq_len, ni + nz)
+        word_embed = torch.cat((word_embed, z_), -1)
+
+        sents_len = sents_len.unsqueeze(1).expand(batch_size, n_sample).contiguous().view(-1)
+        packed_embed = pack_padded_sequence(word_embed, sents_len.tolist(), batch_first=True)
+
+        z = z.view(batch_size * n_sample, self.nz)
+        # h_init = self.trans_linear(z).unsqueeze(0)
+        # c_init = h_init.new_zeros(h_init.size())
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+        output, _ = self.lstm(packed_embed, (h_init, c_init))
+        output, _ = pad_packed_sequence(output, batch_first=True)
+
+        output = self.dropout_out(output)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.pred_linear(output)
+
+        return output_logits
+
+    def reconstruct_error(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: tuple which contains x_ and sents_len
+                    x_: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+
+        x, sents_len = x
+
+        #remove end symbol
+        src = x[:, :-1]
+
+        # remove start symbol
+        tgt = x[:, 1:]
+
+        batch_size, seq_len = src.size()
+        n_sample = z.size(1)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.decode((src, sents_len), z)
+
+        if n_sample == 1:
+            tgt = tgt.contiguous().view(-1)
+        else:
+            # (batch_size * n_sample * seq_len)
+            tgt = tgt.unsqueeze(1).expand(batch_size, n_sample, seq_len) \
+                     .contiguous().view(-1)
+
+        # (batch_size * n_sample * seq_len)
+        loss = self.loss(output_logits.view(-1, output_logits.size(2)),
+                         tgt)
+
+
+        # (batch_size, n_sample)
+        return loss.view(batch_size, n_sample, -1).sum(-1)
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/modules/decoders/decoder.py b/Optimus/code/examples/big_ae/modules/decoders/decoder.py
new file mode 100755
index 0000000000000000000000000000000000000000..da75beb16da7e929f04c5178336096ecc6e7facf
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/decoders/decoder.py
@@ -0,0 +1,79 @@
+import torch
+import torch.nn as nn
+
+
+class DecoderBase(nn.Module):
+    """docstring for Decoder"""
+    def __init__(self):
+        super(DecoderBase, self).__init__()
+
+
+    def freeze(self):
+        for param in self.parameters():
+            param.requires_grad = False
+
+    def decode(self, x, z):
+        """
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns: Tensor1
+            Tensor1: the output logits with size (batch_size * n_sample, seq_len, vocab_size)
+        """
+
+        raise NotImplementedError
+
+    def reconstruct_error(self, x, z):
+        """reconstruction loss
+        Args:
+            x: (batch_size, *)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+
+        raise NotImplementedError
+
+    def beam_search_decode(self, z, K):
+        """beam search decoding
+        Args:
+            z: (batch_size, nz)
+            K: the beam size
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        raise NotImplementedError
+
+    def sample_decode(self, z):
+        """sampling from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        raise NotImplementedError
+
+    def greedy_decode(self, z):
+        """greedy decoding from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        raise NotImplementedError
+
+    def log_probability(self, x, z):
+        """
+        Args:
+            x: (batch_size, *)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+
+        raise NotImplementedError
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/modules/encoders/__init__.py b/Optimus/code/examples/big_ae/modules/encoders/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..8b63707c81d4f1872b1d02baac891c8ac40b32f8
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/encoders/__init__.py
@@ -0,0 +1 @@
+from .enc_lstm import *
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/modules/encoders/enc_lstm.py b/Optimus/code/examples/big_ae/modules/encoders/enc_lstm.py
new file mode 100755
index 0000000000000000000000000000000000000000..3fe5a1a342bf5d823dc9f43141cad2a7a80f6ee7
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/encoders/enc_lstm.py
@@ -0,0 +1,126 @@
+from itertools import chain
+import math
+import torch
+import torch.nn as nn
+
+from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
+from .gaussian_encoder import GaussianEncoderBase
+from ..utils import log_sum_exp
+
+class GaussianLSTMEncoder(GaussianEncoderBase):
+    """Gaussian LSTM Encoder with constant-length input"""
+    def __init__(self, args, vocab_size, model_init, emb_init):
+        super(GaussianLSTMEncoder, self).__init__()
+        self.ni = args.ni
+        self.nh = args.enc_nh
+        self.nz = args.nz
+        self.args = args
+
+        self.embed = nn.Embedding(vocab_size, args.ni)
+
+        self.lstm = nn.LSTM(input_size=args.ni,
+                            hidden_size=args.enc_nh,
+                            num_layers=1,
+                            batch_first=True,
+                            dropout=0)
+
+        self.linear = nn.Linear(args.enc_nh, 2 * args.nz, bias=False)
+
+        self.reset_parameters(model_init, emb_init)
+
+    def reset_parameters(self, model_init, emb_init):
+        # for name, param in self.lstm.named_parameters():
+        #     # self.initializer(param)
+        #     if 'bias' in name:
+        #         nn.init.constant_(param, 0.0)
+        #         # model_init(param)
+        #     elif 'weight' in name:
+        #         model_init(param)
+
+        # model_init(self.linear.weight)
+        # emb_init(self.embed.weight)
+        for param in self.parameters():
+            model_init(param)
+        emb_init(self.embed.weight)
+
+
+    def forward(self, input):
+        """
+        Args:
+            x: (batch_size, seq_len)
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+
+        # (batch_size, seq_len-1, args.ni)
+        word_embed = self.embed(input)
+
+        _, (last_state, last_cell) = self.lstm(word_embed)
+
+        mean, logvar = self.linear(last_state).chunk(2, -1)
+
+        # fix variance as a pre-defined value
+        if self.args.fix_var > 0:
+            logvar = mean.new_tensor([[[math.log(self.args.fix_var)]]]).expand_as(mean)
+            
+        return mean.squeeze(0), logvar.squeeze(0)
+
+    # def eval_inference_mode(self, x):
+    #     """compute the mode points in the inference distribution
+    #     (in Gaussian case)
+    #     Returns: Tensor
+    #         Tensor: the posterior mode points with shape (*, nz)
+    #     """
+
+    #     # (batch_size, nz)
+    #     mu, logvar = self.forward(x)
+
+
+class VarLSTMEncoder(GaussianLSTMEncoder):
+    """Gaussian LSTM Encoder with variable-length input"""
+    def __init__(self, args, vocab_size, model_init, emb_init):
+        super(VarLSTMEncoder, self).__init__(args, vocab_size, model_init, emb_init)
+
+
+    def forward(self, input):
+        """
+        Args:
+            input: tuple which contains x and sents_len
+                    x: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+
+        input, sents_len = input
+        # (batch_size, seq_len, args.ni)
+        word_embed = self.embed(input)
+
+        packed_embed = pack_padded_sequence(word_embed, sents_len.tolist(), batch_first=True)
+
+        _, (last_state, last_cell) = self.lstm(packed_embed)
+
+        mean, logvar = self.linear(last_state).chunk(2, -1)
+
+        return mean.squeeze(0), logvar.squeeze(0)
+
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Args:
+            input: tuple which contains x and sents_len
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
diff --git a/Optimus/code/examples/big_ae/modules/encoders/encoder.py b/Optimus/code/examples/big_ae/modules/encoders/encoder.py
new file mode 100755
index 0000000000000000000000000000000000000000..6daed22c92648923eb90a1f49d91a07f75d63262
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/encoders/encoder.py
@@ -0,0 +1,58 @@
+import math
+import torch
+import torch.nn as nn
+
+from ..utils import log_sum_exp
+
+class EncoderBase(nn.Module):
+    """docstring for EncoderBase"""
+    def __init__(self):
+        super(EncoderBase, self).__init__()
+
+    def forward(self, x):
+        """
+        Args:
+            x: (batch_size, *)
+        Returns: the tensors required to parameterize a distribution.
+        E.g. for Gaussian encoder it returns the mean and variance tensors
+        """
+
+        raise NotImplementedError
+
+    def sample(self, input, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+
+        raise NotImplementedError
+
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        raise NotImplementedError
+
+
+    def eval_inference_dist(self, x, z, param=None):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+
+        raise NotImplementedError
+
+    def calc_mi(self, x):
+        """Approximate the mutual information between x and z
+        I(x, z) = E_xE_{q(z|x)}log(q(z|x)) - E_xE_{q(z|x)}log(q(z))
+        Returns: Float
+        """
+
+        raise NotImplementedError
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/modules/encoders/gaussian_encoder.py b/Optimus/code/examples/big_ae/modules/encoders/gaussian_encoder.py
new file mode 100755
index 0000000000000000000000000000000000000000..1b97e7eec85a7d4fcf064da1c90bbc07e8b97073
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/encoders/gaussian_encoder.py
@@ -0,0 +1,147 @@
+import math
+import torch
+import torch.nn as nn
+
+from .encoder import EncoderBase
+from ..utils import log_sum_exp
+
+class GaussianEncoderBase(EncoderBase):
+    """docstring for EncoderBase"""
+    def __init__(self):
+        super(GaussianEncoderBase, self).__init__()
+
+    def freeze(self):
+        for param in self.parameters():
+            param.requires_grad = False
+
+    def forward(self, x):
+        """
+        Args:
+            x: (batch_size, *)
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+
+        raise NotImplementedError
+
+    def encode_stats(self, x):
+
+        return self.forward(x)
+
+    def sample(self, input, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        return z, (mu, logvar)
+
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
+
+    def reparameterize(self, mu, logvar, nsamples=1):
+        """sample from posterior Gaussian family
+        Args:
+            mu: Tensor
+                Mean of gaussian distribution with shape (batch, nz)
+            logvar: Tensor
+                logvar of gaussian distibution with shape (batch, nz)
+        Returns: Tensor
+            Sampled z with shape (batch, nsamples, nz)
+        """
+        batch_size, nz = mu.size()
+        std = logvar.mul(0.5).exp()
+
+        mu_expd = mu.unsqueeze(1).expand(batch_size, nsamples, nz)
+        std_expd = std.unsqueeze(1).expand(batch_size, nsamples, nz)
+
+        eps = torch.zeros_like(std_expd).normal_()
+
+        return mu_expd + torch.mul(eps, std_expd)
+
+    def eval_inference_dist(self, x, z, param=None):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+
+        nz = z.size(2)
+
+        if not param:
+            mu, logvar = self.forward(x)
+        else:
+            mu, logvar = param
+
+        # (batch_size, 1, nz)
+        mu, logvar = mu.unsqueeze(1), logvar.unsqueeze(1)
+        var = logvar.exp()
+
+        # (batch_size, nsamples, nz)
+        dev = z - mu
+
+        # (batch_size, nsamples)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+        return log_density
+
+
+
+    def calc_mi(self, x):
+        """Approximate the mutual information between x and z
+        I(x, z) = E_xE_{q(z|x)}log(q(z|x)) - E_xE_{q(z|x)}log(q(z))
+        Returns: Float
+        """
+
+        # [x_batch, nz]
+        mu, logvar = self.forward(x)
+
+        x_batch, nz = mu.size()
+
+        # E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
+        neg_entropy = (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).mean()
+
+        # [z_batch, 1, nz]
+        z_samples = self.reparameterize(mu, logvar, 1)
+
+        # [1, x_batch, nz]
+        mu, logvar = mu.unsqueeze(0), logvar.unsqueeze(0)
+        var = logvar.exp()
+
+        # (z_batch, x_batch, nz)
+        dev = z_samples - mu
+
+        # (z_batch, x_batch)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+        # log q(z): aggregate posterior
+        # [z_batch]
+        log_qz = log_sum_exp(log_density, dim=1) - math.log(x_batch)
+
+        return (neg_entropy - log_qz.mean(-1)).item()
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/modules/spacefusion.py b/Optimus/code/examples/big_ae/modules/spacefusion.py
new file mode 100755
index 0000000000000000000000000000000000000000..bacfd96016853c56ddf1774c37b238b6be4737a3
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/spacefusion.py
@@ -0,0 +1,143 @@
+from .vae import VAE
+import numpy as np
+import torch, copy, pdb
+import torch.nn.functional as F
+
+from torch import nn
+
+import pdb
+
+
+def set_trainable(module, value):
+    for param in module.parameters():
+        param.requires_grad = value
+
+class SpaceFusion(VAE):
+    def __init__(self, encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args): 
+        super(SpaceFusion, self).__init__(encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args)
+        children = [v for v in encoder.encoder.layer.children()]    # list of 12 BertLayer
+
+        self.num_s2s_bert_layer = args.num_s2s_bert_layer
+        self.S2S_layers = nn.ModuleList([copy.deepcopy(c) for c in children[-args.num_s2s_bert_layer:] ])    # the last layer of encoder
+        self.S2S_pooler = copy.deepcopy(encoder.pooler)
+        self.ix_turn_sep = tokenizer_encoder.convert_tokens_to_ids('[SEP]')
+        if args.freeze_bert:
+            print('@'*20 + f' freezing BERT {args.num_frozen_bert_layer} layers')
+            for child in children[:args.num_frozen_bert_layer]:
+                set_trainable(child, False)
+
+
+
+    def ids2speaker(self, ids):
+        # 0 for speaker A, 1 for speaker B
+        N, T = ids.shape
+        speaker = np.zeros((N, T))
+        sep = ids == self.ix_turn_sep
+        for i in range(N):
+            is_B = False    # start with speaker A
+            for t in range(T):
+                speaker[i,t] = int(is_B)
+                if sep[i,t].item():
+                    is_B = not is_B
+
+        # make sure the final speaker is speaker B (so response is always speaker A)
+        if not is_B:
+            speaker = 1 - speaker
+
+        return torch.LongTensor(speaker).to(ids.device)
+
+    def forward(self, inputs_src, inputs_tgt, labels_tgt, return_vec=False):  # [batch, time]
+        # toggle config to get desired encoder output
+        self.encoder.encoder.output_attentions = False
+        self.encoder.encoder.output_hidden_states = True
+
+        
+        # AE encoder
+        mask = (inputs_tgt > 0).float().to(inputs_src.device)
+        outputs = self.encoder(inputs_tgt, attention_mask=mask)
+        z_AE, _ = self.connect(outputs[1])
+        z_AE = z_AE.squeeze(1)
+
+        # S2S encoder
+        mask = (inputs_src > 0).float()
+        speaker = self.ids2speaker(inputs_src)
+        outputs = self.encoder(inputs_src, attention_mask=mask, token_type_ids=speaker)
+        _, _, all_layer_attn = outputs      # last_layer_attn, pooled, all_layer_attn = outputs
+        seq_z_prev = all_layer_attn[-self.num_s2s_bert_layer-1]     # seq of z at layer 11 ()
+
+        for s2s in self.S2S_layers: 
+            layer_outputs = s2s(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+            seq_z_prev = layer_outputs[0]
+
+        z_S2S = self.encoder.pooler(layer_outputs[0])
+        z_S2S, _ = self.connect(z_S2S)
+        z_S2S = z_S2S.squeeze(1)
+
+        if return_vec:
+            return z_AE, z_S2S
+
+        # interpolation/smoothness
+        u = torch.FloatTensor(np.random.random((z_AE.shape[0], 1))).to(inputs_tgt.device)
+        z_interp = u * z_AE + (1 - u) * z_S2S
+        std = 0.1
+        noise = torch.FloatTensor(np.random.normal(size=z_interp.shape) * std).to(z_interp.device)
+        z_interp = z_interp + noise
+
+        loss_rec = 0
+        z_idx = 0
+        for z in [z_AE, z_S2S, z_interp]:
+            #pdb.set_trace()
+            past = z # past = self.decoder.linear(z)
+            outputs = self.decoder(input_ids=labels_tgt, past=past, labels=labels_tgt, label_ignore=self.pad_token_id)
+            if z_idx == 1:
+                loss_rec = loss_rec + 1.0 * outputs[0]
+            else:
+                loss_rec = loss_rec + outputs[0]
+            z_idx += 1
+        loss_rec = loss_rec/3
+        
+        # fusion/regularization
+        L_pull = self.dist_pair(z_AE, z_S2S)
+        L_push = torch.stack([self.dist_batch(z) for z in [z_AE, z_S2S]]).min()
+        loss_reg = (L_pull - L_push * 2) / np.sqrt(z.shape[-1])
+        
+        loss = loss_rec + self.args.beta * loss_reg
+        return loss_rec, loss_reg, loss
+
+    def sent2latent(self, inputs_src):
+        # toggle config to get desired encoder output
+        self.encoder.encoder.output_attentions = False
+        self.encoder.encoder.output_hidden_states = True
+
+        # S2S encoder
+        mask = (inputs_src > 0).float()
+        speaker = self.ids2speaker(inputs_src)
+        outputs = self.encoder(inputs_src, attention_mask=mask, token_type_ids=speaker)
+
+        _, _, all_layer_attn = outputs      # last_layer_attn, pooled, all_layer_attn = outputs
+        # seq_z_prev = all_layer_attn[-2]     # seq of z at layer 11 ()
+        # layer_outputs = self.S2S_layer(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+
+        seq_z_prev = all_layer_attn[-self.num_s2s_bert_layer-1]     # seq of z at layer 11 ()
+        for s2s in self.S2S_layers: 
+            layer_outputs = s2s(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+            seq_z_prev = layer_outputs[0]
+
+        z_S2S = self.encoder.pooler(layer_outputs[0])
+        z_S2S, _ = self.connect(z_S2S)
+        z_S2S = z_S2S.squeeze(1)
+        
+        return z_S2S
+
+
+    def dist_pair(self, a, b):
+        return F.pairwise_distance(a, b).mean()
+
+
+    def dist_batch(self, vec):
+        n = vec.shape[0]
+        dmin = []
+        for i in range(n):
+            dd = F.pairwise_distance(vec[i:i+1,:].repeat(n,1), vec)
+            dmin.append(dd.min())
+        return torch.stack(dmin).mean()
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/modules/utils.py b/Optimus/code/examples/big_ae/modules/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..57afd02c2d43e895143569a0f29e431043510409
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/utils.py
@@ -0,0 +1,40 @@
+import torch
+
+def safe_log(z):
+    return torch.log(z + 1e-7)
+
+def log_sum_exp(value, dim=None, keepdim=False):
+    """Numerically stable implementation of the operation
+    value.exp().sum(dim, keepdim).log()
+    """
+    if dim is not None:
+        m, _ = torch.max(value, dim=dim, keepdim=True)
+        value0 = value - m
+        if keepdim is False:
+            m = m.squeeze(dim)
+        return m + torch.log(torch.sum(torch.exp(value0), dim=dim, keepdim=keepdim))
+    else:
+        m = torch.max(value)
+        sum_exp = torch.sum(torch.exp(value - m))
+        return m + torch.log(sum_exp)
+
+
+def generate_grid(zmin, zmax, dz, device, ndim=2):
+    """generate a 1- or 2-dimensional grid
+    Returns: Tensor, int
+        Tensor: The grid tensor with shape (k^2, 2),
+            where k=(zmax - zmin)/dz
+        int: k
+    """
+
+    if ndim == 2:
+        x = torch.arange(zmin, zmax, dz)
+        k = x.size(0)
+
+        x1 = x.unsqueeze(1).repeat(1, k).view(-1)
+        x2 = x.repeat(k)
+
+        return torch.cat((x1.unsqueeze(-1), x2.unsqueeze(-1)), dim=-1).to(device), k
+
+    elif ndim == 1:
+        return torch.arange(zmin, zmax, dz).unsqueeze(1).to(device)
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/modules/vae.py b/Optimus/code/examples/big_ae/modules/vae.py
new file mode 100755
index 0000000000000000000000000000000000000000..e3e697383556b455ba9a247a51113d281c0cb8cd
--- /dev/null
+++ b/Optimus/code/examples/big_ae/modules/vae.py
@@ -0,0 +1,638 @@
+import math
+import torch
+import torch.nn as nn
+
+from .utils import log_sum_exp
+
+import pdb
+
+import logging
+logger = logging.getLogger(__name__)
+
+
+class VAE(nn.Module):
+    """VAE with normal prior"""
+    def __init__(self, encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args): # 
+        super(VAE, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.eos_token_id = tokenizer_decoder.convert_tokens_to_ids([tokenizer_decoder.eos_token])[0]
+        self.pad_token_id = tokenizer_decoder.convert_tokens_to_ids([tokenizer_decoder.pad_token])[0]
+
+
+        # connector: from Bert hidden units to the latent space
+        # self.linear = nn.Linear(args.nz, 2 * args.nz, bias=False)
+
+        # Standard Normal prior
+        loc = torch.zeros(self.nz, device=args.device)
+        scale = torch.ones(self.nz, device=args.device)
+        self.prior = torch.distributions.normal.Normal(loc, scale)
+
+    def connect(self, bert_fea, nsamples=1):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+
+        mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        # pdb.set_trace()
+        # mean, logvar = mean.squeeze(0), logvar.squeeze(0)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mean, logvar, nsamples)
+        KL = 0.5 * (mean.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
+
+    def connect_deterministic(self, bert_fea, nsamples=1):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+
+        mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        # pdb.set_trace()
+        # mean, logvar = mean.squeeze(0), logvar.squeeze(0)
+
+        logvar.fill_(.0)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mean, logvar, nsamples)
+        KL = 0.5 * (mean.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
+
+
+
+    def reparameterize(self, mu, logvar, nsamples=1):
+        """sample from posterior Gaussian family
+        Args:
+            mu: Tensor
+                Mean of gaussian distribution with shape (batch, nz)
+            logvar: Tensor
+                logvar of gaussian distibution with shape (batch, nz)
+        Returns: Tensor
+            Sampled z with shape (batch, nsamples, nz)
+        """
+        batch_size, nz = mu.size()
+        std = logvar.mul(0.5).exp()
+
+        mu_expd = mu.unsqueeze(1).expand(batch_size, nsamples, nz)
+        std_expd = std.unsqueeze(1).expand(batch_size, nsamples, nz)
+
+        eps = torch.zeros_like(std_expd).normal_()
+
+        return mu_expd + torch.mul(eps, std_expd)
+
+    def forward(self, inputs, labels):
+
+        # pdb.set_trace()   
+        
+        attention_mask=(inputs > 0).float()
+        # logger.info(inputs)
+        # logger.info(attention_mask)
+        # logger.info(labels)
+        reconstrution_mask=(labels != 50257).float() # 50257 is the padding token for GPT2
+        sent_length = torch.sum(reconstrution_mask, dim=1)
+
+        
+        outputs = self.encoder(inputs, attention_mask)
+        pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+        if self.args.fb_mode==0: 
+            # Connect hidden feature to the latent space
+            latent_z, loss_kl = self.connect(pooled_hidden_fea)
+            latent_z = latent_z.squeeze(1)
+
+            
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+    
+        elif self.args.fb_mode==1:  
+            # Connect hidden feature to the latent space
+            mu, logvar = self.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+            latent_z = self.reparameterize(mu, logvar, nsamples=1)
+            latent_z = latent_z.squeeze(1)
+            loss_kl = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1)
+            kl_mask = (loss_kl > self.args.dim_target_kl).float()
+            loss_kl = (kl_mask * loss_kl).sum(dim=1)
+
+            # pdb.set_trace()
+            # past = self.decoder.linear(latent_z)
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+        elif self.args.fb_mode==2: 
+            # Connect hidden feature to the latent space
+            latent_z, loss_kl = self.connect_deterministic(pooled_hidden_fea)
+            latent_z = latent_z.squeeze(1)
+
+            # past = self.decoder.linear(latent_z)
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            
+        # pdb.set_trace()
+        if self.args.length_weighted_loss:
+            loss = loss_rec / sent_length + self.args.beta * loss_kl
+        else:
+            loss = loss_rec + self.args.beta * loss_kl 
+
+
+        return loss_rec, loss_kl, loss
+
+
+
+    def encoder_sample(self, bert_fea, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+
+        # (batch_size, nz)
+
+        mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        mu, logvar = mu.squeeze(0), logvar.squeeze(0)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        return z, (mu, logvar)
+
+
+    def encode_stats(self, x):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean of latent z with shape [batch, nz]
+            Tensor2: the logvar of latent z with shape [batch, nz]
+        """
+
+        return self.encoder.encode_stats(x)
+
+    def decode(self, z, strategy, K=10):
+        """generate samples from z given strategy
+        Args:
+            z: [batch, nsamples, nz]
+            strategy: "beam" or "greedy" or "sample"
+            K: the beam width parameter
+        Returns: List1
+            List1: a list of decoded word sequence
+        """
+
+        if strategy == "beam":
+            return self.decoder.beam_search_decode(z, K)
+        elif strategy == "greedy":
+            return self.decoder.greedy_decode(z)
+        elif strategy == "sample":
+            return self.decoder.sample_decode(z)
+        else:
+            raise ValueError("the decoding strategy is not supported")
+
+
+    def reconstruct(self, x, decoding_strategy="greedy", K=5):
+        """reconstruct from input x
+        Args:
+            x: (batch, *)
+            decoding_strategy: "beam" or "greedy" or "sample"
+            K: the beam width parameter
+        Returns: List1
+            List1: a list of decoded word sequence
+        """
+        z = self.sample_from_inference(x).squeeze(1)
+
+        return self.decode(z, decoding_strategy, K)
+
+    def log_probability(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+        outputs = self.decoder(input_ids=x, past=z, labels=x, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+        return -loss_rec
+
+
+
+    def loss_iw(self, x0, x1, nsamples=50, ns=1):
+        """
+        Args:
+            x: if the data is constant-length, x is the data tensor with
+                shape (batch, *). Otherwise x is a tuple that contains
+                the data tensor and length list
+        Returns: Tensor1, Tensor2, Tensor3
+            Tensor1: total loss [batch]
+            Tensor2: reconstruction loss shape [batch]
+            Tensor3: KL loss shape [batch]
+        """
+
+        # encoding into bert features
+        bert_fea = self.encoder(x0)[1]
+
+        # (batch_size, nz)
+
+        mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        
+
+        ##################
+        # compute KL
+        ##################
+        # pdb.set_trace()
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        # mu, logvar = mu.squeeze(0), logvar.squeeze(0)
+        ll_tmp, rc_tmp = [], []
+        for _ in range(int(nsamples / ns)):
+
+            # (batch, nsamples, nz)
+            z = self.reparameterize(mu, logvar, ns)
+            # past = self.decoder.linear(z)
+            past = z
+         
+            # [batch, nsamples]
+            log_prior = self.eval_prior_dist(z)
+            log_gen = self.eval_cond_ll(x1, past)
+            log_infer = self.eval_inference_dist(z, (mu, logvar))
+
+            # pdb.set_trace()
+            log_gen = log_gen.unsqueeze(0).contiguous().view(z.shape[0],-1)
+
+
+            # pdb.set_trace()
+            rc_tmp.append(log_gen)
+            ll_tmp.append(log_gen + log_prior - log_infer)
+
+            
+        
+        log_prob_iw = log_sum_exp(torch.cat(ll_tmp, dim=-1), dim=-1) - math.log(nsamples)
+        log_gen_iw = torch.mean(torch.cat(rc_tmp, dim=-1), dim=-1)
+
+        return log_prob_iw, log_gen_iw , KL
+
+
+    def nll_iw(self, x0, x1, nsamples, ns=1):
+        """compute the importance weighting estimate of the log-likelihood
+        Args:
+            x0, x1:  two different tokenization results of x, where x is the data tensor with shape (batch, *). 
+            nsamples: Int
+                the number of samples required to estimate marginal data likelihood
+        Returns: Tensor1
+            Tensor1: the estimate of log p(x), shape [batch]
+        """
+
+        # compute iw every ns samples to address the memory issue
+        # nsamples = 500, ns = 100
+        # nsamples = 500, ns = 10
+
+        # TODO: note that x is forwarded twice in self.encoder.sample(x, ns) and self.eval_inference_dist(x, z, param)
+        #.      this problem is to be solved in order to speed up
+
+        tmp = []
+        for _ in range(int(nsamples / ns)):
+            # [batch, ns, nz]
+
+            # Chunyuan:
+            # encoding into bert features
+            pooled_hidden_fea = self.encoder(x0)[1]
+
+            # param is the parameters required to evaluate q(z|x)
+            z, param = self.encoder_sample(pooled_hidden_fea, ns)
+
+            # [batch, ns]
+            log_comp_ll = self.eval_complete_ll(x1, z)
+            log_infer_ll = self.eval_inference_dist(z, param)
+
+            tmp.append(log_comp_ll - log_infer_ll)
+
+        ll_iw = log_sum_exp(torch.cat(tmp, dim=-1), dim=-1) - math.log(nsamples)
+
+        return ll_iw
+
+    def KL(self, x):
+        _, KL = self.encode(x, 1)
+
+        return KL
+
+    def eval_prior_dist(self, zrange):
+        """perform grid search to calculate the true posterior
+        Args:
+            zrange: tensor
+                different z points that will be evaluated, with
+                shape (k^2, nz), where k=(zmax - zmin)/space
+        """
+
+        # (k^2)
+        return self.prior.log_prob(zrange).sum(dim=-1)
+
+    def eval_complete_ll(self, x, z):
+        """compute log p(z,x)
+        Args:
+            x: Tensor
+                input with shape [batch, seq_len]
+            z: Tensor
+                evaluation points with shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log p(z,x) Tensor with shape [batch, nsamples]
+        """
+
+        # [batch, nsamples]
+        log_prior = self.eval_prior_dist(z)
+        log_gen = self.eval_cond_ll(x, z)
+
+        return log_prior + log_gen
+
+
+
+    def eval_cond_ll(self, x, z):
+        """compute log p(x|z)
+        """
+        x_shape = list(x.size())
+        z_shape = list(z.size())
+        if len(z_shape) == 3:
+            x = x.unsqueeze(1).repeat(1, z_shape[1], 1).contiguous().view(x_shape[0]*z_shape[1], x_shape[-1]) 
+            z = z.contiguous().view(x_shape[0]*z_shape[1], z_shape[-1]) 
+
+        return self.log_probability(x, z)
+
+
+
+    def eval_log_model_posterior(self, x, grid_z):
+        """perform grid search to calculate the true posterior
+         this function computes p(z|x)
+        Args:
+            grid_z: tensor
+                different z points that will be evaluated, with
+                shape (k^2, nz), where k=(zmax - zmin)/pace
+        Returns: Tensor
+            Tensor: the log posterior distribution log p(z|x) with
+                    shape [batch_size, K^2]
+        """
+        try:
+            batch_size = x.size(0)
+        except:
+            batch_size = x[0].size(0)
+
+        # (batch_size, k^2, nz)
+        grid_z = grid_z.unsqueeze(0).expand(batch_size, *grid_z.size()).contiguous()
+
+        # (batch_size, k^2)
+        log_comp = self.eval_complete_ll(x, grid_z)
+
+        # normalize to posterior
+        log_posterior = log_comp - log_sum_exp(log_comp, dim=1, keepdim=True)
+
+        return log_posterior
+
+    def sample_from_inference(self, x, nsamples=1):
+        """perform sampling from inference net
+        Returns: Tensor
+            Tensor: samples from infernece nets with
+                shape (batch_size, nsamples, nz)
+        """
+        z, _ = self.encoder.sample(x, nsamples)
+
+        return z
+
+
+    def sample_from_posterior(self, x, nsamples):
+        """perform MH sampling from model posterior
+        Returns: Tensor
+            Tensor: samples from model posterior with
+                shape (batch_size, nsamples, nz)
+        """
+
+        # use the samples from inference net as initial points
+        # for MCMC sampling. [batch_size, nsamples, nz]
+        cur = self.encoder.sample_from_inference(x, 1)
+        cur_ll = self.eval_complete_ll(x, cur)
+        total_iter = self.args.mh_burn_in + nsamples * self.args.mh_thin
+        samples = []
+        for iter_ in range(total_iter):
+            next = torch.normal(mean=cur,
+                std=cur.new_full(size=cur.size(), fill_value=self.args.mh_std))
+            # [batch_size, 1]
+            next_ll = self.eval_complete_ll(x, next)
+            ratio = next_ll - cur_ll
+
+            accept_prob = torch.min(ratio.exp(), ratio.new_ones(ratio.size()))
+
+            uniform_t = accept_prob.new_empty(accept_prob.size()).uniform_()
+
+            # [batch_size, 1]
+            mask = (uniform_t < accept_prob).float()
+            mask_ = mask.unsqueeze(2)
+
+            cur = mask_ * next + (1 - mask_) * cur
+            cur_ll = mask * next_ll + (1 - mask) * cur_ll
+
+            if iter_ >= self.args.mh_burn_in and (iter_ - self.args.mh_burn_in) % self.args.mh_thin == 0:
+                samples.append(cur.unsqueeze(1))
+
+        return torch.cat(samples, dim=1)
+
+
+    def calc_model_posterior_mean(self, x, grid_z):
+        """compute the mean value of model posterior, i.e. E_{z ~ p(z|x)}[z]
+        Args:
+            grid_z: different z points that will be evaluated, with
+                    shape (k^2, nz), where k=(zmax - zmin)/pace
+            x: [batch, *]
+        Returns: Tensor1
+            Tensor1: the mean value tensor with shape [batch, nz]
+        """
+
+        # [batch, K^2]
+        log_posterior = self.eval_log_model_posterior(x, grid_z)
+        posterior = log_posterior.exp()
+
+        # [batch, nz]
+        return torch.mul(posterior.unsqueeze(2), grid_z.unsqueeze(0)).sum(1)
+
+    def calc_infer_mean(self, x):
+        """
+        Returns: Tensor1
+            Tensor1: the mean of inference distribution, with shape [batch, nz]
+        """
+
+        mean, logvar = self.encoder.forward(x)
+
+        return mean
+
+
+ 
+
+    def eval_inference_dist(self, z, param):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+
+        nz = z.size(2)
+        mu, logvar = param
+
+        # (batch_size, 1, nz)
+        mu, logvar = mu.unsqueeze(1), logvar.unsqueeze(1)
+        var = logvar.exp()
+
+        # (batch_size, nsamples, nz)
+        dev = z - mu
+
+        # (batch_size, nsamples)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+        return log_density
+
+
+
+    def calc_mi(self, test_data_batch, args):
+        # calc_mi_v3
+        import math 
+        from modules.utils import log_sum_exp
+
+        mi = 0
+        num_examples = 0
+
+        mu_batch_list, logvar_batch_list = [], []
+        neg_entropy = 0.
+        for batch_data in test_data_batch:
+
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+
+            (batch_size, nz)
+            mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+
+            x_batch, nz = mu.size()
+
+            #print(x_batch, end=' ')
+
+            num_examples += x_batch
+
+            # E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
+
+            neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()
+            mu_batch_list += [mu.cpu()]
+            logvar_batch_list += [logvar.cpu()]
+
+            pdb.set_trace()
+
+        neg_entropy = neg_entropy / num_examples
+        ##print()
+
+        num_examples = 0
+        log_qz = 0.
+        for i in range(len(mu_batch_list)):
+            ###############
+            # get z_samples
+            ###############
+            mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+            
+            # [z_batch, 1, nz]
+
+            z_samples = self.reparameterize(mu, logvar, 1)
+
+            z_samples = z_samples.view(-1, 1, nz)
+            num_examples += z_samples.size(0)
+
+            ###############
+            # compute density
+            ###############
+            # [1, x_batch, nz]
+            #mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+            #indices = list(np.random.choice(np.arange(len(mu_batch_list)), 10)) + [i]
+            indices = np.arange(len(mu_batch_list))
+            mu = torch.cat([mu_batch_list[_] for _ in indices], dim=0).cuda()
+            logvar = torch.cat([logvar_batch_list[_] for _ in indices], dim=0).cuda()
+            x_batch, nz = mu.size()
+
+            mu, logvar = mu.unsqueeze(0), logvar.unsqueeze(0)
+            var = logvar.exp()
+
+            # (z_batch, x_batch, nz)
+            dev = z_samples - mu
+
+            # (z_batch, x_batch)
+            log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+                0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+            # log q(z): aggregate posterior
+            # [z_batch]
+            log_qz += (log_sum_exp(log_density, dim=1) - math.log(x_batch)).sum(-1)
+
+        log_qz /= num_examples
+        mi = neg_entropy - log_qz
+
+        return mi
+
+
+
+    def calc_au(self, eval_dataloader, args, delta=0.01):
+        """compute the number of active units
+        """
+        cnt = 0
+        for batch_data in eval_dataloader:
+
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+
+            # (batch_size, nz)
+            mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+
+            if cnt == 0:
+                means_sum = mean.sum(dim=0, keepdim=True)
+            else:
+                means_sum = means_sum + mean.sum(dim=0, keepdim=True)
+            cnt += mean.size(0)
+
+        # (1, nz)
+        mean_mean = means_sum / cnt
+
+        cnt = 0
+        for batch_data in eval_dataloader:
+
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+
+            # (batch_size, nz)
+            mean, _ = self.encoder.linear(bert_fea).chunk(2, -1)
+
+            if cnt == 0:
+                var_sum = ((mean - mean_mean) ** 2).sum(dim=0)
+            else:
+                var_sum = var_sum + ((mean - mean_mean) ** 2).sum(dim=0)
+            cnt += mean.size(0)
+
+        # (nz)
+        au_var = var_sum / (cnt - 1)
+
+        return (au_var >= delta).sum().item(), au_var
+
diff --git a/Optimus/code/examples/big_ae/run_data_filtering.py b/Optimus/code/examples/big_ae/run_data_filtering.py
new file mode 100755
index 0000000000000000000000000000000000000000..675f140a6e88112a19608a2a12de6e57ee9cf786
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_data_filtering.py
@@ -0,0 +1,507 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import json
+import random
+from pathlib import Path
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, BucketingDataLoader, MultipleFiles_DataLoader, BucketingMultipleFiles_DataLoader, frange_cycle_linear, frange_cycle_zero_linear)
+
+from modules import VAE
+
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+        file_path=args.input_file_path
+        dataloader = MultipleFiles_DataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True, use_tensor=False)
+    else:
+        pass 
+    return dataloader
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+
+    files = Path(args.input_file_path)
+    num_files = len(list(files.glob('*seq64*.json')))
+
+    # create output file folder
+    if not os.path.exists(args.output_file_path) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_file_path)
+    
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num files = %d", num_files)
+    logger.info("  Num examples of first file = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+
+    num_collected, num_dropped = 0, 0 
+
+    model_vae.zero_grad()
+    num_train_epochs_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+
+    tmp_list = []
+    dict_token_length = defaultdict(int)
+
+
+    if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_dir)
+
+    dict_file = os.path.join(args.output_dir, args.dataset.lower()+f'.length_freq.json' )
+
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in num_train_epochs_iterator:
+        
+        for idx_file in range(num_files):
+
+            examples = []
+            cached_features_file = os.path.join(args.output_file_path, args.dataset.lower()+f'.segmented.nltk.split.seq64.{train_dataloader.file_idx}.json' )
+            logger.info(f"Epoch {epoch}, File idx {train_dataloader.file_idx}") 
+            epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+
+            # if idx_file > 11:
+            #     break
+
+            for step, batch in enumerate(epoch_iterator):
+
+                inst, token_lengths = batch
+                dict_token_length[ token_lengths[0,0].item() ] += 1
+                
+                if ( token_lengths> 256 ).sum().item()>0: 
+                    over_length_tensor = ( token_lengths> 256 ).sum(-1)
+                    inst_ = [inst[i] for i in range(len(inst)) if over_length_tensor[i]==0 ]
+                    examples += inst_
+                    num_collected += len(inst_)
+                    num_dropped   += len(inst) - len(inst_)
+                    logger.info(f"{num_dropped} files filtered.")
+                else:
+                    examples += inst
+                    num_collected += len(inst)
+
+            # Good practice: save your data multiple times on Philly
+
+            if args.use_philly:
+                save_solid = False
+                while not save_solid:
+                    try:
+                        with open(cached_features_file, 'w') as fp:
+                            json.dump(examples, fp)
+                        save_solid = True
+                    except:
+                        pass
+            else:
+                with open(cached_features_file, 'w') as fp:
+                    json.dump(examples, fp)
+            logger.info(f"Saving features in the cached file at {cached_features_file}")
+
+        train_dataloader.reset()
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    logger.info(dict_token_length)
+    # Good practice: save your dict multiple times on Philly
+    if args.use_philly:
+        save_solid = False
+        while not save_solid:
+            try:
+                with open(dict_file, 'w') as fp:
+                    json.dump(dict_token_length, fp)
+                save_solid = True
+            except:
+                pass
+    else:
+        with open(dict_file, 'w') as fp:
+            json.dump(dict_token_length, fp)
+
+    return num_collected, num_dropped
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--input_file_path", default=None, type=str, required=True,
+                        help="The output directory where the input files will be written.")
+    parser.add_argument("--output_file_path", default=None, type=str, required=True,
+                        help="The output directory where the output files will be written.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the logs and results will be saved.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+
+
+    ## Other parameters
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+
+    if os.path.exists(args.output_file_path) and os.listdir(args.output_file_path) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_file_path))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    logger.info(f'Local rank is {args.local_rank}')
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    # model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        num_collected, num_dropped = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" num_collected = %s, num_dropped = %s", num_collected, num_dropped)
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_dialog_dataloader.py b/Optimus/code/examples/big_ae/run_dialog_dataloader.py
new file mode 100755
index 0000000000000000000000000000000000000000..a95b8c355897b537fcad9cebfd011d95071b7ad9
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_dialog_dataloader.py
@@ -0,0 +1,483 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, Dialog_BucketingDataLoader, TextDataset_Split, TextDataset_2Tokenizers, frange_cycle_linear, frange_cycle_zero_linear)
+
+
+from modules import VAE
+
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+        dataloader = Dialog_BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass 
+    return dataloader
+
+
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+
+    model_vae.zero_grad()
+   
+    # model_vae = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training   
+    
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+
+    tmp_list = []
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            input_ids_bert_ctx, input_ids_bert, input_ids_gpt, token_lengths = batch
+
+            logger.info(f'Conxtext in Bert, Length {token_lengths[0]} ; Tokens: {input_ids_bert_ctx}')
+            logger.info(f'Response in Bert, Length {token_lengths[1]} ; Tokens: {input_ids_bert}')
+            logger.info(f'Response in GPT2, Length {token_lengths[2]} ; Tokens: {input_ids_gpt}')
+            # TODO: write donw training scripts for dialog response generation
+
+
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+
+                global_step += 1
+
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+
+            
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step
+
+
+
+
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")                    
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    if args.use_pretrained_model:
+
+        args.encoder_model_type = args.encoder_model_type.lower()
+        args.decoder_model_type = args.decoder_model_type.lower()
+
+        global_step = args.gloabl_step_eval
+
+        output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+        # Load a trained Encoder model and vocabulary
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+        model_encoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+        # Load a trained Decoder model and vocabulary
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+    else:
+        ## Encoder 
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+        model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+        # model_encoder.to(args.device)
+
+        ## Decoder 
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+        model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+        
+    pdb.set_trace()
+
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    # model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s", global_step)
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_encoding_generation.py b/Optimus/code/examples/big_ae/run_encoding_generation.py
new file mode 100755
index 0000000000000000000000000000000000000000..f5ef1226208cba52389dca6b305bf3d93b930ac7
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_encoding_generation.py
@@ -0,0 +1,487 @@
+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+
+
+import pdb
+
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+
+
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass 
+    return dataloader
+
+
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+
+
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        for _ in trange(length):
+
+            inputs = {'input_ids': generated}
+            if is_xlnet: 
+                # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+                # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+                input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+                perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+                perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+                target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+                target_mapping[0, 0, -1] = 1.0  # predict last token
+                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+    return generated
+
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None):
+    
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        while True:
+        # for _ in trange(length):
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+
+    return generated
+
+
+
+# a wrapper function to choose between different play modes
+def evaluate_latent_space(args, model_vae, encoder_tokenizer, decoder_tokenizer, prefix=""):
+
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+
+    # Eval!
+    logger.info("***** Running recontruction evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.per_gpu_eval_batch_size)
+    
+    model_vae.eval()
+
+    model_vae =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    if args.play_mode == 'reconstrction':
+        result = calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_recontruction_results.txt"
+    elif args.play_mode == 'interpolation':
+        result = calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_interpolation_results.txt"
+    else:
+        logger.info("Please specify the corrent play mode [reconstrction, interpolation]")
+        
+
+    eval_output_dir = args.output_dir
+    output_eval_file = os.path.join(eval_output_dir, result_file_name)
+
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval {} results *****".format(args.play_mode))
+        for key in sorted(result.keys()):
+            logger.info("  %s \n %s", key, str(result[key]))
+            writer.write("%s \n %s\n" % (key, str(result[key])))
+
+    return result
+
+
+def calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+
+    count = 0
+    result = defaultdict(str)
+    for batch in tqdm(eval_dataloader, desc="Evaluating recontruction"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x1 = x1[:,:max_len_values[1]]
+
+        x0 = x0.to(args.device)
+        x1 = x1.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+
+        context_tokens = decoder_tokenizer.encode('<BOS>')
+
+        with torch.no_grad():
+
+            text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+            # result["INPUT TEXT " + str(count)].append(text_x0)
+
+            pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+  
+            # Connect hidden feature to the latent space
+            # latent_z, loss_kl = model_vae.connect(pooled_hidden_fea)
+            mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+            latent_z = mean.squeeze(1)
+
+            past = latent_z
+            out = sample_sequence_conditional(
+                model=model_vae.decoder,
+                context=context_tokens,
+                past=past,
+                length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[text_x0] = text_x1
+
+        count += 1
+        if count>args.total_sents:
+            break
+        
+
+    return result
+
+
+
+
+def calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+
+    count = 0
+    latent_codes = []
+    sample_interval = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating interpolation"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x0 = x0.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+
+
+        with torch.no_grad():
+            if sample_interval == 0 or sample_interval == args.total_sents:
+                text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+                pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+    
+                # Connect hidden feature to the latent space
+                mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+                latent_z = mean.squeeze(1)
+                
+                latent_codes.append(latent_z)
+
+                if sample_interval == 5: 
+                    latent_codes.append(latent_z)
+                    sample_interval = 0
+                    continue
+            else: 
+                sample_interval += 1
+                continue
+
+        count += 1
+        if count>args.total_sents:
+            break                
+
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+    result = defaultdict(str)
+    latent_codes_interpolation = []
+    num_steps = args.num_interpolation_steps
+    for step in range(num_steps+1):
+        latent_z = latent_codes[0] + (latent_codes[1] - latent_codes[0]) * step * 1.0/num_steps
+
+        past = latent_z
+        out = sample_sequence_conditional(
+            model=model_vae.decoder,
+            context=context_tokens,
+            past=past,
+            length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            device=args.device,
+            decoder_tokenizer = decoder_tokenizer
+        )
+        text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+        text_x1 = text_x1.split()[1:-1]
+        text_x1 = ' '.join(text_x1) 
+        result[step] = text_x1
+
+    return result
+
+
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_interpolation_steps", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--play_mode", default="interpolation", type=str,
+                        help="interpolation or reconstruction.")
+
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+
+
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    args = parser.parse_args()
+
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+
+    set_seed(args)
+
+
+    args.encoder_model_type = args.encoder_model_type.lower()
+    args.decoder_model_type = args.decoder_model_type.lower()
+
+
+    global_step = args.gloabl_step_eval
+
+    output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+    checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+    # Load a trained Encoder model and vocabulary that you have fine-tuned
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+    model_encoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+    # Load full model
+    output_full_dir    = os.path.join(args.checkpoint_dir, 'checkpoint-full-{}'.format(global_step)) 
+    checkpoint = torch.load(os.path.join(output_full_dir, 'training.bin'))
+
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    
+    # Evaluation
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args)
+    model_vae.load_state_dict(checkpoint['model_state_dict'])
+    logger.info("Pre-trained Optimus is successfully loaded")
+    model_vae.to(args.device)
+
+    result = evaluate_latent_space(args, model_vae, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/big_ae/run_generation_from_prior.py b/Optimus/code/examples/big_ae/run_generation_from_prior.py
new file mode 100755
index 0000000000000000000000000000000000000000..8866fc0c4564d080f0c1c32946f933dc83fb4d64
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_generation_from_prior.py
@@ -0,0 +1,414 @@
+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+
+cwd = os.getcwd()
+print(f"Current working dir is {cwd}")
+
+import sys
+sys.path.append('./')
+pt_path = os.path.join( cwd, 'pytorch_transformers')
+sys.path.append(pt_path)
+print(f"Pytorch Transformer {pt_path}")
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+
+import pytorch_transformers
+
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+from metrics import Bleu, SelfBleu
+
+
+
+import pdb
+
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+
+
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass 
+    return dataloader
+
+
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0,  filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    
+    # top-k
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    # top-p
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+
+
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        for _ in trange(length):
+
+            inputs = {'input_ids': generated}
+            if is_xlnet: 
+                # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+                # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+                input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+                perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+                perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+                target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+                target_mapping[0, 0, -1] = 1.0  # predict last token
+                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+    return generated
+
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None, max_seq_length=-1):
+    
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    gen_seq_length = 0
+    with torch.no_grad():
+        while True:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            gen_seq_length += 1
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+            if max_seq_length>0 and gen_seq_length>max_seq_length:
+                break
+
+    return generated
+
+
+def evaluate_generation_fromp_prior(model_vae, decoder_tokenizer, args, ns=1):
+
+    loc = torch.zeros([args.nz]).to(args.device)
+    scale = torch.ones([args.nz]).to(args.device)
+    prior = torch.distributions.normal.Normal(loc, scale)
+    
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+
+    count = 0
+    result = defaultdict(str)
+    for i in tqdm(range(args.num_sents)):
+
+        with torch.no_grad():
+            latent_z = prior.sample()
+            # pdb.set_trace()
+            past = model_vae.decoder.linear(latent_z.unsqueeze(0))
+            
+            # pdb.set_trace()
+            out = sample_sequence_conditional(
+                model=model_vae.decoder,
+                context=context_tokens,
+                past=past,
+                length=args.max_seq_length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer, 
+                max_seq_length = args.max_seq_length
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[i] = text_x1
+
+        if args.use_philly:
+            print("PROGRESS: {}%".format( round(100 * i /args.num_sents , 4))) 
+
+    with open(args.output_generation_file, "w") as writer:
+        logger.info("***** SHOW generated sentences from prior *****")
+        for key in sorted(result.keys()):
+            # logger.info("  %s \n %s", key, str(result[key]))
+            # writer.write("%s \n %s\n" % (key, str(result[key])))
+            writer.write("%s" % str(result[key]))
+
+    return result
+
+
+# bleu = evaluate_bleu(results, args)
+
+
+
+
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_sents", default=10, type=int, help="Total sentences to generate.")
+
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+
+
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    args = parser.parse_args()
+
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+
+    set_seed(args)
+
+
+    args.encoder_model_type = args.encoder_model_type.lower()
+    args.decoder_model_type = args.decoder_model_type.lower()
+
+
+    global_step = args.gloabl_step_eval
+
+    output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+    checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+    # Load a trained Encoder model and vocabulary that you have fine-tuned
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+    model_encoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+    # pdb.set_trace()
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    
+    # Evaluation
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+
+    if not os.path.exists(args.output_dir): os.makedirs(args.output_dir)
+    args.output_generation_file = os.path.join(args.output_dir, f"generation_from_vae_prior_t{args.temperature}_p{args.top_p}.txt")
+    # args.output_generation_file = args.train_data_file
+    result = evaluate_generation_fromp_prior(model_vae, tokenizer_decoder, args)
+
+    
+    bleu5 = Bleu(test_text= args.output_generation_file,
+                 real_text=args.eval_data_file,
+                 num_real_sentences=args.num_sents,
+                 num_fake_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The bleu score is {bleu5}')
+
+    sbleu5 = SelfBleu(test_text= args.output_generation_file,
+                 num_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The self-bleu score is {sbleu5}')
+
+    args.eval_results_file = os.path.join(args.output_dir, f"eval_results_t{args.temperature}_p{args.top_p}.txt")
+    eval_results = {'bleu5':bleu5 , 'sbleu5':sbleu5}
+    with open(args.eval_results_file, "w") as writer:
+        logger.info("***** SHOW the quantative evalution results *****")
+        for key in sorted(eval_results.keys()):
+            writer.write("%s %s" % (key, str(eval_results[key]))  )
+
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/big_ae/run_gpt2_generation.py b/Optimus/code/examples/big_ae/run_gpt2_generation.py
new file mode 100755
index 0000000000000000000000000000000000000000..3d8816e8abcea925891e599c323201cc152f2e13
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_gpt2_generation.py
@@ -0,0 +1,390 @@
+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+
+cwd = os.getcwd()
+print(f"Current working dir is {cwd}")
+
+import sys
+sys.path.append('./')
+pt_path = os.path.join( cwd, 'pytorch_transformers')
+sys.path.append(pt_path)
+print(f"Pytorch Transformer {pt_path}")
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+
+import pytorch_transformers
+
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+from metrics import Bleu, SelfBleu
+
+
+
+import pdb
+
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+
+
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass 
+    return dataloader
+
+
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0,  filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    
+    # top-k
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    # top-p
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+
+
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu', decoder_tokenizer=None, max_seq_length=-1):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    gen_seq_length = 0
+    with torch.no_grad():
+        while True:
+
+            inputs = {'input_ids': generated}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            gen_seq_length += 1
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+            if max_seq_length>0 and gen_seq_length>max_seq_length:
+                break
+
+
+    return generated
+
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None, max_seq_length=-1):
+    
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    gen_seq_length = 0
+    with torch.no_grad():
+        while True:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+            gen_seq_length += 1
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+            if max_seq_length>0 and gen_seq_length>max_seq_length:
+                break
+
+    return generated
+
+
+def evaluate_generation_from_gpt2(model, decoder_tokenizer, args, ns=1):
+
+    loc = torch.zeros([args.nz]).to(args.device)
+    scale = torch.ones([args.nz]).to(args.device)
+    prior = torch.distributions.normal.Normal(loc, scale)
+    
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+
+    count = 0
+    result = defaultdict(str)
+    for i in tqdm(range(args.num_sents)):
+
+        with torch.no_grad():
+
+            out = sample_sequence(
+                model=model,
+                context=context_tokens,
+                length=args.max_seq_length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer, 
+                max_seq_length = args.max_seq_length
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[i] = text_x1
+
+        if args.use_philly:
+            print("PROGRESS: {}%".format( round(100 * i /args.num_sents , 4))) 
+
+    with open(args.output_generation_file, "w") as writer:
+        logger.info("***** SHOW generated sentences from prior *****")
+        for key in sorted(result.keys()):
+            # logger.info("  %s \n %s", key, str(result[key]))
+            # writer.write("%s \n %s\n" % (key, str(result[key])))
+            writer.write("%s" % str(result[key]))
+
+    return result
+
+
+# bleu = evaluate_bleu(results, args)
+
+
+
+
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_sents", default=10, type=int, help="Total sentences to generate.")
+
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+
+
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    args = parser.parse_args()
+
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+
+    set_seed(args)
+    args.decoder_model_type = args.decoder_model_type.lower()
+
+
+    global_step = args.gloabl_step_eval
+
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-{}'.format(global_step)) 
+    checkpoints = [ output_decoder_dir ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+    # pdb.set_trace()
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    
+    # Evaluation
+    if not os.path.exists(args.output_dir): os.makedirs(args.output_dir)
+    args.output_generation_file = os.path.join(args.output_dir, f"generation_from_gpt2_t{args.temperature}_p{args.top_p}.txt")
+    # args.output_generation_file = args.train_data_file
+    result = evaluate_generation_from_gpt2(model_decoder, tokenizer_decoder, args)
+
+    bleu5 = Bleu(test_text= args.output_generation_file,
+                 real_text=args.eval_data_file,
+                 num_real_sentences=args.num_sents,
+                 num_fake_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The bleu score is {bleu5}')
+
+    sbleu5 = SelfBleu(test_text= args.output_generation_file,
+                 num_sentences=args.num_sents,
+                 gram=5).get_score()
+    logger.info(f'The self-bleu score is {sbleu5}')
+
+    args.eval_results_file = os.path.join(args.output_dir, f"eval_results_t{args.temperature}_p{args.top_p}.txt")
+    eval_results = {'bleu5':bleu5 , 'sbleu5':sbleu5}
+    with open(args.eval_results_file, "w") as writer:
+        logger.info("***** SHOW the quantative evalution results *****")
+        for key in sorted(eval_results.keys()):
+            writer.write("%s %s" % (key, str(eval_results[key]))  )
+
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/big_ae/run_latent_generation.py b/Optimus/code/examples/big_ae/run_latent_generation.py
new file mode 100755
index 0000000000000000000000000000000000000000..890ad7c76c3a7048727edb69e30a91b6254b9f53
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_latent_generation.py
@@ -0,0 +1,577 @@
+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, BertConfig
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2ForLatentConnector
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from pytorch_transformers import BertForLatentConnector, BertTokenizer
+
+from collections import defaultdict
+from modules import VAE
+from utils import (TextDataset_Split, TextDataset_2Tokenizers, BucketingDataLoader)
+
+
+import pdb
+
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer)
+}
+
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+
+
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass 
+    return dataloader
+
+
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+
+
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        for _ in trange(length):
+
+            inputs = {'input_ids': generated}
+            if is_xlnet: 
+                # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+                # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+                input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+                perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+                perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+                target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+                target_mapping[0, 0, -1] = 1.0  # predict last token
+                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+    return generated
+
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None):
+    
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        while True:
+        # for _ in trange(length):
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0]:
+                break
+
+    return generated
+
+
+def latent_code_from_text(text, tokenizer_encoder, model_vae, args):
+    tokenized1 = tokenizer_encoder.encode(text)
+    tokenized1 = [101] + tokenized1 + [102]
+    coded1 = torch.Tensor([tokenized1])
+    coded1 =torch.Tensor.long(coded1)
+    with torch.no_grad():
+        x0 = coded1
+        x0 = x0.to(args.device)
+        pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+        mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+        latent_z = mean.squeeze(1)  
+        coded_length = len(tokenized1)
+        return latent_z, coded_length
+
+def text_from_latent_code(latent_z, model_vae, args, tokenizer_decoder):
+    past = latent_z
+    context_tokens = tokenizer_decoder.encode('<BOS>')
+
+    length = 128 # maximum length, but not used 
+    out = sample_sequence_conditional(
+        model=model_vae.decoder,
+        context=context_tokens,
+        past=past,
+        length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        device=args.device,
+        decoder_tokenizer = tokenizer_decoder
+    )
+    text_x1 = tokenizer_decoder.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+    text_x1 = text_x1.split()[1:-1]
+    text_x1 = ' '.join(text_x1)
+    return text_x1
+
+
+# a wrapper function to choose between different play modes
+def evaluate_latent_space(args, model_vae, encoder_tokenizer, decoder_tokenizer, prefix=""):
+
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+
+    # Eval!
+    logger.info("***** Running recontruction evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.per_gpu_eval_batch_size)
+    
+    model_vae.eval()
+
+    model_vae =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    if args.play_mode == 'reconstrction':
+        result = calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_recontruction_results.txt"
+    elif args.play_mode == 'interpolation':
+        result = calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=100)
+        result_file_name = "eval_interpolation_results.txt"
+    else:
+        logger.info("Please specify the corrent play mode [reconstrction, interpolation]")
+        
+
+    eval_output_dir = args.output_dir
+    output_eval_file = os.path.join(eval_output_dir, result_file_name)
+
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval {} results *****".format(args.play_mode))
+        for key in sorted(result.keys()):
+            logger.info("  %s \n %s", key, str(result[key]))
+            writer.write("%s \n %s\n" % (key, str(result[key])))
+
+    return result
+
+
+def calc_rec(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+
+    count = 0
+    result = defaultdict(str)
+    for batch in tqdm(eval_dataloader, desc="Evaluating recontruction"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x1 = x1[:,:max_len_values[1]]
+
+        x0 = x0.to(args.device)
+        x1 = x1.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+
+        context_tokens = decoder_tokenizer.encode('<BOS>')
+
+        with torch.no_grad():
+
+            text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+            # result["INPUT TEXT " + str(count)].append(text_x0)
+
+            pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+  
+            # Connect hidden feature to the latent space
+            # latent_z, loss_kl = model_vae.connect(pooled_hidden_fea)
+            mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+            latent_z = mean.squeeze(1)
+
+            past = latent_z
+            out = sample_sequence_conditional(
+                model=model_vae.decoder,
+                context=context_tokens,
+                past=past,
+                length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                device=args.device,
+                decoder_tokenizer = decoder_tokenizer
+            )
+            text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+            text_x1 = text_x1.split()[1:-1]
+            text_x1 = ' '.join(text_x1) + '\n'
+            result[text_x0] = text_x1
+
+        count += 1
+        if count>args.total_sents:
+            break
+        
+
+    return result
+
+
+
+
+def calc_interpolate(model_vae, eval_dataloader, encoder_tokenizer, decoder_tokenizer, args, ns=1):
+
+    count = 0
+    latent_codes = []
+    sample_interval = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating interpolation"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x0 = x0.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+
+
+        with torch.no_grad():
+            if sample_interval == 0 or sample_interval == args.total_sents:
+                text_x0 = encoder_tokenizer.decode(x0[0,:x_lengths[0,0]].tolist(), clean_up_tokenization_spaces=True)[0]
+                pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
+    
+                # Connect hidden feature to the latent space
+                mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+                latent_z = mean.squeeze(1)
+                
+                latent_codes.append(latent_z)
+
+                if sample_interval == 5: 
+                    latent_codes.append(latent_z)
+                    sample_interval = 0
+                    continue
+            else: 
+                sample_interval += 1
+                continue
+
+        count += 1
+        if count>args.total_sents:
+            break                
+
+    context_tokens = decoder_tokenizer.encode('<BOS>')
+    result = defaultdict(str)
+    latent_codes_interpolation = []
+    num_steps = args.num_interpolation_steps
+    for step in range(num_steps+1):
+        latent_z = latent_codes[0] + (latent_codes[1] - latent_codes[0]) * step * 1.0/num_steps
+
+        past = latent_z
+        out = sample_sequence_conditional(
+            model=model_vae.decoder,
+            context=context_tokens,
+            past=past,
+            length=x_lengths[0,1], # Chunyuan: Fix length; or use <EOS> to complete a sentence
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            device=args.device,
+            decoder_tokenizer = decoder_tokenizer
+        )
+        text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
+        text_x1 = text_x1.split()[1:-1]
+        text_x1 = ' '.join(text_x1) 
+        result[step] = text_x1
+
+    return result
+
+
+def interpolate(model_vae, tokenizer_encoder, tokenizer_decoder, args):
+    # and then in the main function         
+    latent_z1, coded_length1 = latent_code_from_text(args.sent_source, tokenizer_encoder, model_vae, args)
+    latent_z2, coded_length2 = latent_code_from_text(args.sent_target, tokenizer_encoder, model_vae, args)
+
+    result = defaultdict(str)
+
+    num_steps = args.num_interpolation_steps + 1
+    for step in range(num_steps+1):
+        latent_z = latent_z1 + (latent_z2 - latent_z1) * step * 1.0/num_steps
+        
+        text_interpolate = text_from_latent_code(latent_z, model_vae, args, tokenizer_decoder)
+        result[step] = text_interpolate
+        print(text_interpolate)
+
+    return result
+
+
+def analogy(model_vae, tokenizer_encoder, tokenizer_decoder, args):
+        
+    latent_z1, coded_length1 = latent_code_from_text(args.sent_source, tokenizer_encoder, model_vae, args)
+    latent_z2, coded_length2 = latent_code_from_text(args.sent_target, tokenizer_encoder, model_vae, args)
+    latent_z3, coded_length3 = latent_code_from_text(args.sent_input, tokenizer_encoder, model_vae, args)
+    
+    result = defaultdict(str)
+
+    latent_z = latent_z3 + args.degree_to_target * (latent_z2 - latent_z1) 
+    
+    text_analogy = text_from_latent_code(latent_z, model_vae, args, tokenizer_decoder)
+    result[0] = text_analogy
+    print(text_analogy)
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default='Snli', type=str, help="The dataset.")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--num_interpolation_steps", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--play_mode", default="interpolation", type=str,
+                        help="interpolation or reconstruction.")
+
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+
+    parser.add_argument("--per_gpu_train_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+
+    # Interact with users
+    parser.add_argument("--interact_with_user_input", action='store_true', help="Use user input to interact_with.")
+    parser.add_argument("--sent_source", type=str, default="")
+    parser.add_argument("--sent_target", type=str, default="")
+    parser.add_argument("--sent_input", type=str, default="")
+    parser.add_argument("--degree_to_target", type=float, default="1.0")
+
+    ## Variational auto-encoder
+    parser.add_argument("--nz", default=32, type=int,
+                        help="Latent space dimension.")
+
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=1.0)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    args = parser.parse_args()
+
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+
+    set_seed(args)
+
+
+    args.encoder_model_type = args.encoder_model_type.lower()
+    args.decoder_model_type = args.decoder_model_type.lower()
+
+
+    global_step = args.gloabl_step_eval
+
+    output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+    checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+    # Load a trained Encoder model and vocabulary that you have fine-tuned
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+    model_encoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+    # Load a trained Decoder model and vocabulary that you have fine-tuned
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+    # Load full model
+    output_full_dir    = os.path.join(args.checkpoint_dir, 'checkpoint-full-{}'.format(global_step)) 
+    checkpoint = torch.load(os.path.join(output_full_dir, 'training.bin'))
+
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    
+    # Evaluation
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args)
+    model_vae.load_state_dict(checkpoint['model_state_dict'])
+    logger.info("Pre-trained Optimus is successfully loaded")
+    model_vae.to(args.device)
+
+    if args.interact_with_user_input:
+
+        if args.play_mode == 'interpolation':
+            if len(args.sent_source) > 0 and len(args.sent_source) > 0:
+                result = interpolate(model_vae, tokenizer_encoder, tokenizer_decoder, args)
+            else:
+                print('Please check: specify the source and target sentences!')
+
+        if args.play_mode == 'analogy':
+            if len(args.sent_source) > 0 and len(args.sent_source) > 0 and len(args.sent_input) > 0:
+                result = analogy(model_vae, tokenizer_encoder, tokenizer_decoder, args)
+            else:
+                print('Please check: specify the source, target and input analogy sentences!')
+
+
+    else:
+        result = evaluate_latent_space(args, model_vae, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_ae_pretraining.py b/Optimus/code/examples/big_ae/run_lm_ae_pretraining.py
new file mode 100755
index 0000000000000000000000000000000000000000..13dc0014d77e5f46846caf53c1db529932d092a3
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_ae_pretraining.py
@@ -0,0 +1,692 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertModel, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertModel, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+
+
+
+class TextDataset_2Tokenizers(Dataset):
+    def __init__(self, tokenizers, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename}')
+
+
+        
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+            # pdb.set_trace()
+            self.examples = []
+            # Chunyuan: divide the linguistic text into the same length, then different tokenization schemes are applied
+            while len(text) >= block_size:  # Truncate in block of block_size
+
+                tokenized_text0 = tokenizers[0].convert_tokens_to_ids(tokenizers[0].tokenize(text[:block_size]))
+                tokenized_text0 = tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                tokenized_text0_length = len(tokenized_text0) 
+                pad_token=tokenizers[0].convert_tokens_to_ids([tokenizers[0].pad_token])[0]
+                tokenized_text0 = tokenized_text0 + ([pad_token] * (block_size - tokenized_text0_length)  ) # Pad up to the sequence length.
+                assert len(tokenized_text0) == block_size
+                
+                tokenized_text1 = tokenizers[1].convert_tokens_to_ids(tokenizers[1].tokenize(text[:block_size]))
+                tokenized_text1 = tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1_length = len(tokenized_text1)
+                pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+                tokenized_text1 = tokenized_text1 + ([pad_token] *  (block_size - tokenized_text1_length) ) # Pad up to the sequence length.
+                assert len(tokenized_text1) == block_size
+
+                self.examples.append([tokenized_text0, tokenized_text0_length, tokenized_text1, tokenized_text1_length])
+
+                text = text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        # pdb.set_trace()
+        # Convert to Tensors and build dataset
+        tokenized_text0= torch.tensor(self.examples[item][0], dtype=torch.long)
+        tokenized_text1= torch.tensor(self.examples[item][2], dtype=torch.long)
+        tokenized_text_lengths = torch.tensor([self.examples[item][1], self.examples[item][3]], dtype=torch.long)
+        # pdb.set_trace()
+        return (tokenized_text0, tokenized_text1, tokenized_text_lengths)
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataset, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_encoder_parameters = [
+        {'params': [p for n, p in model_encoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_encoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+
+    optimizer_grouped_decoder_parameters = [
+        {'params': [p for n, p in model_decoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_decoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+
+
+    optimizer_encoder = AdamW(optimizer_grouped_encoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    optimizer_decoder = AdamW(optimizer_grouped_decoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler_encoder = WarmupLinearSchedule(optimizer_encoder, warmup_steps=args.warmup_steps, t_total=t_total)
+    scheduler_decoder = WarmupLinearSchedule(optimizer_decoder, warmup_steps=args.warmup_steps, t_total=t_total)
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_encoder, optimizer_encoder = amp.initialize(model_encoder, optimizer_encoder, opt_level=args.fp16_opt_level)
+        model_decoder, optimizer_decoder = amp.initialize(model_decoder, optimizer_decoder, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_encoder = torch.nn.DataParallel(model_encoder)
+        model_decoder = torch.nn.DataParallel(model_decoder)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_encoder = torch.nn.parallel.DistributedDataParallel(model_encoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+        model_decoder = torch.nn.parallel.DistributedDataParallel(model_decoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_encoder.zero_grad()
+    model_decoder.zero_grad()
+
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            # tokenized_text0 = tokenized_text0.to(args.device)
+            # tokenized_text1 = tokenized_text1.to(args.device)
+            # prepare input-output data for reconstruction
+            inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+            labels = tokenized_text1
+
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+
+            model_encoder.train()
+            model_decoder.train()
+
+            
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+ 
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=pooled_hidden_fea, labels=labels)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_encoder), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_decoder), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_encoder.parameters(), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(model_decoder.parameters(), args.max_grad_norm)
+                optimizer_encoder.step()
+                optimizer_decoder.step()
+                scheduler_encoder.step()  # Update learning rate schedule
+                scheduler_decoder.step()
+                model_encoder.zero_grad()
+                model_decoder.zero_grad()
+                global_step += 1
+
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr_encoder', scheduler_encoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('lr_decoder', scheduler_decoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+
+                    model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+                    model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+
+                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    model_encoder.eval()
+    model_decoder.eval()
+
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        # pdb.set_trace()
+        tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+        # prepare input-output data for evaluation
+        inputs, labels = tokenized_text0, tokenized_text1
+
+        tokenized_text1 = tokenized_text1.to(args.device)
+        inputs = inputs.to(args.device)
+        labels = labels.to(args.device)
+
+        with torch.no_grad():
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=pooled_hidden_fea, labels=labels)
+            lm_loss = outputs[0]
+
+            eval_loss += lm_loss.mean().item()
+        nb_eval_steps += 1
+
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity = torch.exp(torch.tensor(eval_loss))
+
+    result = {
+        "perplexity": perplexity
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config)
+    model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    model_decoder.to(args.device)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataset = load_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataset, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+
+        # Good practice: save your training arguments together with the trained model
+        model_encoder_to_save.save_pretrained(output_encoder_dir)
+        torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+        model_decoder_to_save.save_pretrained(output_decoder_dir)
+        torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+
+        
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(output_encoder_dir, do_lower_case=args.do_lower_case)
+        model_encoder.to(args.device)
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(output_decoder_dir, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+        
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        global_step= 881
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint[0].split('-')[-1] if len(checkpoints) > 1 else ""
+
+            model_encoder = encoder_model_class.from_pretrained(checkpoint[0])
+            model_encoder.to(args.device)            
+            model_decoder = decoder_model_class.from_pretrained(checkpoint[1])
+            model_decoder.to(args.device)
+            result = evaluate(args, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_causal_pretraining.py b/Optimus/code/examples/big_ae/run_lm_causal_pretraining.py
new file mode 100755
index 0000000000000000000000000000000000000000..f8350521228688068947a1899a5f0f9b95fa3749
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_causal_pretraining.py
@@ -0,0 +1,692 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertModel, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertModel, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+
+
+
+class TextDataset_2Tokenizers(Dataset):
+    def __init__(self, tokenizers, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename}')
+
+
+        
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+            # pdb.set_trace()
+            self.examples = []
+            # Chunyuan: divide the linguistic text into the same length, then different tokenization schemes are applied
+            while len(text) >= block_size:  # Truncate in block of block_size
+
+                tokenized_text0 = tokenizers[0].convert_tokens_to_ids(tokenizers[0].tokenize(text[:block_size]))
+                tokenized_text0 = tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                tokenized_text0_length = len(tokenized_text0) 
+                pad_token=tokenizers[0].convert_tokens_to_ids([tokenizers[0].pad_token])[0]
+                tokenized_text0 = tokenized_text0 + ([pad_token] * (block_size - tokenized_text0_length)  ) # Pad up to the sequence length.
+                assert len(tokenized_text0) == block_size
+                
+                tokenized_text1 = tokenizers[1].convert_tokens_to_ids(tokenizers[1].tokenize(text[:block_size]))
+                tokenized_text1 = tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1_length = len(tokenized_text1)
+                pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+                tokenized_text1 = tokenized_text1 + ([pad_token] *  (block_size - tokenized_text1_length) ) # Pad up to the sequence length.
+                assert len(tokenized_text1) == block_size
+
+                self.examples.append([tokenized_text0, tokenized_text0_length, tokenized_text1, tokenized_text1_length])
+
+                text = text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        # pdb.set_trace()
+        # Convert to Tensors and build dataset
+        tokenized_text0= torch.tensor(self.examples[item][0], dtype=torch.long)
+        tokenized_text1= torch.tensor(self.examples[item][2], dtype=torch.long)
+        tokenized_text_lengths = torch.tensor([self.examples[item][1], self.examples[item][3]], dtype=torch.long)
+        # pdb.set_trace()
+        return (tokenized_text0, tokenized_text1, tokenized_text_lengths)
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataset, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_encoder_parameters = [
+        {'params': [p for n, p in model_encoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_encoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+
+    optimizer_grouped_decoder_parameters = [
+        {'params': [p for n, p in model_decoder.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_decoder.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+
+
+    optimizer_encoder = AdamW(optimizer_grouped_encoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    optimizer_decoder = AdamW(optimizer_grouped_decoder_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler_encoder = WarmupLinearSchedule(optimizer_encoder, warmup_steps=args.warmup_steps, t_total=t_total)
+    scheduler_decoder = WarmupLinearSchedule(optimizer_decoder, warmup_steps=args.warmup_steps, t_total=t_total)
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_encoder, optimizer_encoder = amp.initialize(model_encoder, optimizer_encoder, opt_level=args.fp16_opt_level)
+        model_decoder, optimizer_decoder = amp.initialize(model_decoder, optimizer_decoder, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_encoder = torch.nn.DataParallel(model_encoder)
+        model_decoder = torch.nn.DataParallel(model_decoder)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_encoder = torch.nn.parallel.DistributedDataParallel(model_encoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+        model_decoder = torch.nn.parallel.DistributedDataParallel(model_decoder, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_encoder.zero_grad()
+    model_decoder.zero_grad()
+
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            # tokenized_text0 = tokenized_text0.to(args.device)
+            # tokenized_text1 = tokenized_text1.to(args.device)
+            # prepare input-output data for reconstruction
+            inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+            labels = tokenized_text1
+
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+
+            model_encoder.train()
+            model_decoder.train()
+
+            
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+ 
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=None, labels=labels)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_encoder), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer_decoder), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_encoder.parameters(), args.max_grad_norm)
+                    torch.nn.utils.clip_grad_norm_(model_decoder.parameters(), args.max_grad_norm)
+                optimizer_encoder.step()
+                optimizer_decoder.step()
+                scheduler_encoder.step()  # Update learning rate schedule
+                scheduler_decoder.step()
+                model_encoder.zero_grad()
+                model_decoder.zero_grad()
+                global_step += 1
+
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr_encoder', scheduler_encoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('lr_decoder', scheduler_decoder.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+
+                    model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+                    model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+
+                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model_encoder, model_decoder, encoder_tokenizer, decoder_tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    model_encoder.eval()
+    model_decoder.eval()
+
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        # pdb.set_trace()
+        tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+        # prepare input-output data for evaluation
+        inputs, labels = tokenized_text0, tokenized_text1
+
+        tokenized_text1 = tokenized_text1.to(args.device)
+        inputs = inputs.to(args.device)
+        labels = labels.to(args.device)
+
+        with torch.no_grad():
+            # Encoding
+            outputs = model_encoder(inputs)
+            pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            # Decoding
+            outputs = model_decoder(input_ids=tokenized_text1, past=None, labels=labels)
+            lm_loss = outputs[0]
+
+            eval_loss += lm_loss.mean().item()
+            nb_eval_steps += 1
+
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity = torch.exp(torch.tensor(eval_loss))
+
+    result = {
+        "perplexity": perplexity
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config)
+    model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    model_decoder.to(args.device)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataset = load_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataset, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        model_encoder_to_save = model_encoder.module if hasattr(model_encoder, 'module') else model_encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+
+        # Good practice: save your training arguments together with the trained model
+        model_encoder_to_save.save_pretrained(output_encoder_dir)
+        torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+        model_decoder_to_save.save_pretrained(output_decoder_dir)
+        torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+
+        
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(output_encoder_dir, do_lower_case=args.do_lower_case)
+        model_encoder.to(args.device)
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(output_decoder_dir, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+        
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        global_step= 881
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint[0].split('-')[-1] if len(checkpoints) > 1 else ""
+
+            model_encoder = encoder_model_class.from_pretrained(checkpoint[0])
+            model_encoder.to(args.device)            
+            model_decoder = decoder_model_class.from_pretrained(checkpoint[1])
+            model_decoder.to(args.device)
+            result = evaluate(args, model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_finetuning_baseline.py b/Optimus/code/examples/big_ae/run_lm_finetuning_baseline.py
new file mode 100755
index 0000000000000000000000000000000000000000..ea749dd47aaef984f1daf38bdcc4df8ce6739978
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_finetuning_baseline.py
@@ -0,0 +1,573 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+import pdb
+
+import sys
+sys.path.insert(0, '.')
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForMaskedLM, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, TextDataset_Split, TextDataset_2Tokenizers)
+
+import pdb
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            tokenized_text1, tokenized_text_lengths = batch
+
+            inputs, labels =  tokenized_text1, tokenized_text1
+
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+
+            model.train()
+
+            outputs = model(inputs, labels=labels, label_ignore=tokenizer.pad_token_id)
+
+            # pdb.set_trace()
+            loss = outputs[0].mean()  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4))) 
+                print("EVALERR: {}%".format(loss)) 
+
+
+
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    eval_loss_sum = 0.0
+    nb_eval_steps = 0
+    report_num_words = 0
+
+    model.eval()
+
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+
+        tokenized_text1, x_lengths = batch
+        x_lengths = x_lengths.to(args.device)
+        report_num_words += x_lengths.sum().item()
+
+        inputs, labels =  tokenized_text1, tokenized_text1
+
+        inputs = inputs.to(args.device)
+        labels = labels.to(args.device)
+
+
+        with torch.no_grad():
+            outputs = model(inputs, labels=labels, label_ignore=tokenizer.pad_token_id)
+            lm_loss = outputs[0]
+
+            
+            eval_loss += lm_loss.mean().item()/x_lengths.sum().item()
+            eval_loss_sum += lm_loss.sum().item()
+
+
+        nb_eval_steps += 1
+
+        # pdb.set_trace()
+
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity1 = torch.exp(torch.tensor(eval_loss))
+    perplexity2 = torch.exp(torch.tensor(eval_loss_sum / report_num_words))
+    
+    
+
+    result = {
+        "perplexity1": perplexity1, "perplexity2": perplexity2
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+
+    parser.add_argument("--model_type", default="bert", type=str,
+                        help="The model architecture to be fine-tuned.")
+    parser.add_argument("--model_name_or_path", default="bert-base-cased", type=str,
+                        help="The model checkpoint for weights initialization.")
+
+
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+
+
+    parser.add_argument('--logging_steps', type=int, default=100,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=100,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer.max_len_single_sentence)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+    model.to(args.device)
+
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer.pad_token == '<PAD>'
+
+
+    # pdb.set_trace()
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Training
+    global_step= 0
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+
+        if global_step == 0:
+            global_step = args.gloabl_step_eval
+        output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        print("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_gpt2_training.py b/Optimus/code/examples/big_ae/run_lm_gpt2_training.py
new file mode 100755
index 0000000000000000000000000000000000000000..8a5d7fccc7401833c9f140a3b5a3dd8d0a8da61f
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_gpt2_training.py
@@ -0,0 +1,658 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (BucketingDataLoader, TextDataset_Split, TextDataset_2Tokenizers)
+
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass 
+    return dataloader
+
+
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataloader, model, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+
+    tmp_list = []
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            inputs, labels =  tokenized_text1.to(args.device), tokenized_text1.to(args.device)
+            
+            model.train()
+
+            outputs = model(inputs, labels=labels, label_ignore=decoder_tokenizer.pad_token_id)
+            loss = outputs[0].mean()  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean()
+
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4))) 
+                print("EVALERR: {}%".format(loss)) 
+
+            epoch_iterator.set_description(
+                (
+                    f'iter: {step +  epoch*len(epoch_iterator) }; loss: {loss.item():.3f}; '
+                )
+            )
+
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()                                   
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+                optimizer.step()
+
+                scheduler.step()  # Update learning rate schedule
+
+                model.zero_grad()
+
+                global_step += 1
+
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    
+                    # Save decoder model checkpoint
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+
+                    model_decoder_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_decoder_to_save.save_pretrained(output_decoder_dir)
+                        torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+
+            
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test"):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    logger.info("***** Running evaluation on {} dataset *****".format(subset))
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.per_gpu_eval_batch_size = 1
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    eval_loss_sum = 0.0
+    nb_eval_steps = 0
+    report_num_words = 0
+
+    model.eval()
+
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+
+        _, tokenized_text1, tokenized_text_lengths = batch
+        inputs, labels =  tokenized_text1.to(args.device), tokenized_text1.to(args.device)
+        
+        x_lengths = tokenized_text_lengths[:,1].to(args.device)
+        report_num_words += x_lengths.sum().item()
+
+
+        with torch.no_grad():
+            outputs = model(inputs, labels=labels, label_ignore=decoder_tokenizer.pad_token_id)
+            lm_loss = outputs[0]
+
+            eval_loss += lm_loss.mean().item()/x_lengths.sum().item()
+            eval_loss_sum += lm_loss.sum().item()
+
+        nb_eval_steps += 1
+
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity1 = torch.exp(torch.tensor(eval_loss))
+    perplexity2 = torch.exp(torch.tensor(eval_loss_sum / report_num_words))
+    
+
+    result = {
+        "perplexity1": perplexity1, "perplexity2": perplexity2
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+
+
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")                     
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+    parser.add_argument("--save_bert_gpt_init", action='store_true',
+                        help="Use Philly for computing.")
+
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    model_decoder.to(args.device)
+
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataloader, model_decoder, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        model_decoder_to_save = model_decoder.module if hasattr(model_decoder, 'module') else model_decoder  # Take care of distributed/parallel training
+
+        # Good practice: save your training arguments together with the trained model
+
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir)
+        model_decoder.to(args.device)
+        
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        if global_step == 0:
+            global_step = args.gloabl_step_eval
+
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ output_decoder_dir ]
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+
+            model_decoder = decoder_model_class.from_pretrained(checkpoint)
+            model_decoder.to(args.device)
+
+            result = evaluate(args, model_decoder, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+            # result = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='train')
+            # result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            # results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_vae_label_ctrl_gen.py b/Optimus/code/examples/big_ae/run_lm_vae_label_ctrl_gen.py
new file mode 100755
index 0000000000000000000000000000000000000000..9c37a80a36b8379130bad914c3723053c7c3bcaa
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_vae_label_ctrl_gen.py
@@ -0,0 +1,875 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+import pdb
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+import sys
+import json
+import nltk
+nltk.download('punkt')
+
+sys.path.append('../../')
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+from utils import (TextDataset_Split, TextDataset_2Tokenizers_LCtrlG,
+                   frange_cycle_linear, frange_cycle_zero_linear, AverageValueMeter)
+# from modules import ARAE
+from modules import CARA
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+import time
+def get_time_str():
+    return time.ctime().replace(' ', '_').replace(':', '-')
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers_LCtrlG(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file,
+                                                 block_size=args.block_size, create_new=args.create_new)
+    else:
+        raise NotImplementedError
+        # dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+def train(args, train_dataset, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, logff):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
+    # model_vae = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    # Train!
+    logger.info("***** Running training *****")
+    logff.write("***** Running training *****\n")
+    logger.info("  Num examples = {}".format(len(train_dataset)))
+    logff.write("  Num examples = {}\n".format(len(train_dataset)))
+    logger.info("  Num Epochs = {}".format(args.num_train_epochs))
+    logff.write("  Num Epochs = {}\n".format(args.num_train_epochs))
+    logger.info("  Instantaneous batch size per GPU = {}".format(args.per_gpu_train_batch_size))
+    logff.write("  Instantaneous batch size per GPU = {}\n".format(args.per_gpu_train_batch_size))
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logff.write("  Total train batch size (w. parallel, distributed & accumulation) = {}\n".format(
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1)))
+    logger.info("  Gradient Accumulation steps = {}".format(args.gradient_accumulation_steps))
+    logff.write("  Gradient Accumulation steps = {}\n".format(args.gradient_accumulation_steps))
+    logger.info("  Total optimization steps = {}".format( t_total))
+    logff.write("  Total optimization steps = {}\n".format(t_total))
+    logff.flush()
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_vae.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=1.0, stop=args.beta_cls,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    accmeter = {
+        'acc_encode_z_dis': AverageValueMeter(),
+        'acc_gen_z_dis': AverageValueMeter(),
+        'acc_encode_z_cls': AverageValueMeter(),
+        'acc_cls': AverageValueMeter(),
+        # 'acc_at_soft_cls': AverageValueMeter(),
+    }
+    lossmeter = {
+        'loss': AverageValueMeter(),
+        'loss_rec': AverageValueMeter(),
+        'loss_encoder': AverageValueMeter(),
+        'loss_lsc': AverageValueMeter(),
+        'loss_lsd': AverageValueMeter(),
+        'loss_lsg': AverageValueMeter(),
+        'loss_cls': AverageValueMeter(),
+        # 'loss_at_soft_cls': AverageValueMeter(),
+    }
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        # pbar = tqdm(total=(len(train_dataloader)+1) // args.gradient_accumulation_steps)
+        for step, batch in enumerate(train_dataloader):
+
+            # if step > 100:
+            #     break
+
+            # Data
+            input_seq_ids, tgt_seq_ids, tokenized_text_lengths, cond_labels = batch
+            max_len_values, _ = tokenized_text_lengths.max(0)
+            input_seq_ids = input_seq_ids[:,:max_len_values[0]]
+            tgt_seq_ids = tgt_seq_ids[:,:max_len_values[1]]
+            input_seq_ids, tgt_seq_ids = mask_tokens(input_seq_ids, encoder_tokenizer, args) if args.mlm else (input_seq_ids, tgt_seq_ids)
+            input_seq_ids = input_seq_ids.to(args.device)
+            tgt_seq_ids = tgt_seq_ids.to(args.device)
+            cond_labels = cond_labels.to(args.device)
+            input_mask = torch.where(torch.arange(max_len_values[0].item()).unsqueeze(0).repeat(input_seq_ids.size(0), 1).type_as(tokenized_text_lengths).to(args.device)
+                                     < tokenized_text_lengths[:, 0].unsqueeze(1).to(args.device), torch.ones_like(input_seq_ids), torch.zeros_like(input_seq_ids))
+
+            # Configs
+            model_vae.train()
+            beta_t = beta_t_list[step +  epoch*len(epoch_iterator)]
+            model_vae.module.args.beta_cls = beta_t
+            # if beta_t == 0.0:
+            #     model_vae.args.fb_mode = 0
+            # else:
+            #     model_vae.args.fb_mode = 1
+            # if args.use_deterministic_connect:
+            #     model_vae.args.fb_mode = 2
+
+            # Model
+            loss_dict, acc_dict = model_vae(input_seq_ids=input_seq_ids, tgt_seq_ids=tgt_seq_ids, cond_labels=cond_labels, attention_mask=input_mask)
+
+            # Loss
+            for key, value in loss_dict.items():
+                loss_dict[key] = value.mean()
+
+            loss = loss_dict['loss']
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+            tr_loss += loss.item()
+
+            # Log
+            for key, value in loss_dict.items():
+                lossmeter[key].add(value.item())
+
+            for key, value in acc_dict.items():
+                value = value.cpu().tolist()
+                for v in value:
+                    accmeter[key].add(float(v))
+
+            # Optimize
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                # Optimize
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model_vae.zero_grad()
+                global_step += 1
+                # pbar.update(1)
+
+                # Log
+                if global_step % args.logging_steps == 0:
+                    logger.info("\n")
+                    logger.info("global_step: {}, avg loss: {:3f}".format(global_step, tr_loss/global_step))
+                    logff.write("global_step: {}, avg loss: {:3f}\n".format(global_step, tr_loss/global_step))
+                    logger.info("loss: {}".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in lossmeter.items())))
+                    logff.write("loss: {}\n".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in lossmeter.items())))
+                    logger.info("acc: {}".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in accmeter.items())))
+                    logff.write("acc: {}\n".format(', '.join(key + ': ' + str(round(meter.mean, 3)) for key, meter in accmeter.items())))
+                    logff.flush()
+
+
+                if args.use_philly:
+                    #if args.local_rank in [-1, 0]:
+                    if args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        logger.info("PROGRESS: {}%".format(round(100 * (step +  epoch*len(train_dataloader) ) /(int(args.num_train_epochs) *  len(train_dataloader)) , 4)))
+                        logger.info("EVALERR: {}%".format(tr_loss / global_step))
+
+
+                if args.local_rank in [-1, 0] and args.eval_steps > 0 and global_step % args.eval_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, epoch=epoch)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.eval_steps, global_step)
+                    logging_loss = tr_loss
+
+                # Save checkpoints
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save encoder model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+                    model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_encoder_to_save.save_pretrained(output_encoder_dir)
+                        torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_encoder_dir)
+
+                    # Save decoder model checkpoint
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+                    model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_decoder_to_save.save_pretrained(output_decoder_dir)
+                        torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                break
+
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test", epoch=None):
+
+    eval_output_dir = args.output_dir
+
+    if subset == 'test':
+        eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    elif subset == 'train':
+        eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+    else:
+        raise ValueError
+        
+    args.label_size = len(eval_dataset.get_labels())
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    logger.info("  Num steps = %d", len(eval_dataset) // args.eval_batch_size)
+    logger.info("  eval_output_dir = %s", eval_output_dir)
+
+    model_vae.eval()
+    model_vae_module =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    outputs = {
+        'sampled_cond_labels': None,
+        'cond_labels': None,
+        'tgt_seq_ids': None,
+        'generated': None,
+        'at_generated': None,
+        'cg_generated': None,
+        'pred_cls': None,
+        'pred_ge_cls': None,
+        'pred_at_cls': None,
+        'pred_cg_cls': None,
+    }
+
+    for bi, batch in enumerate(tqdm(eval_dataloader, desc="#Sentences", disable=args.local_rank not in [-1, 0]) ):
+        # if bi == 3:
+        #     break
+
+        # Data
+        input_seq_ids, tgt_seq_ids, tokenized_text_lengths, cond_labels = batch
+        max_len_values, _ = tokenized_text_lengths.max(0)
+        input_seq_ids = input_seq_ids[:,:max_len_values[0]]
+        tgt_seq_ids = tgt_seq_ids[:,:max_len_values[1]]
+        input_seq_ids = input_seq_ids.to(args.device)
+        tgt_seq_ids = tgt_seq_ids.to(args.device)
+        cond_labels = cond_labels.to(args.device)
+        input_mask = torch.where(torch.arange(max_len_values[0].item()).unsqueeze(0).repeat(input_seq_ids.size(0), 1).type_as(tokenized_text_lengths).to(args.device)
+                                     < tokenized_text_lengths[:, 0].unsqueeze(1).to(args.device), torch.ones_like(input_seq_ids), torch.zeros_like(input_seq_ids))
+
+        # Model
+        with torch.no_grad():
+            result = model_vae(input_seq_ids=input_seq_ids, tgt_seq_ids=tgt_seq_ids, cond_labels=cond_labels, attention_mask=input_mask)
+        if bi == 0:
+            for key in outputs.keys():
+                outputs[key] = result[key].cpu().tolist()
+        else:
+            for key in outputs.keys():
+                outputs[key].extend(result[key].cpu().tolist())
+
+    # compute accuracies and store in results
+    acc = np.mean(np.array(np.array(outputs['pred_cls']) == np.array(outputs['cond_labels']), dtype=np.float))
+    acc_ge = np.mean(np.array(np.array(outputs['pred_ge_cls']) == np.array(outputs['cond_labels']), dtype=np.float))
+    acc_at = np.mean(np.array(np.array(outputs['pred_at_cls']) == np.array(outputs['sampled_cond_labels']), dtype=np.float))
+    acc_cg = np.mean(np.array(np.array(outputs['pred_cg_cls']) == np.array(outputs['sampled_cond_labels']), dtype=np.float))
+    metrics = {'acc': acc, 'acc_ge': acc_ge, 'acc_at': acc_at, 'acc_cg': acc_cg}
+
+    # dump generated outputs to file.
+    json.dump(outputs, open(os.path.join(eval_output_dir, "outputs_{}.json".format(epoch) if epoch is not None else "outputs.json"), 'w'))
+
+    # compute BLEU
+    bos_token_id = model_vae_module.tokenizer_decoder.encode('<BOS>')[0]
+    eos_token_id = model_vae_module.tokenizer_decoder.encode('<EOS>')[0]
+    pad_token_id = model_vae_module.tokenizer_decoder.encode('<PAD>')[0]
+
+    generated_ids = []
+    generated_text = []
+    for g in outputs['generated']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        generated_ids.append(g)
+        generated_text.append(g_text)
+
+    tgt_seq_ids = []
+    tgt_seq_text = []
+    for g in outputs['tgt_seq_ids']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        tgt_seq_ids.append(g)
+        tgt_seq_text.append(g_text)
+
+    at_generated_ids = []
+    at_generated_text = []
+    for g in outputs['at_generated']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        at_generated_ids.append(g)
+        at_generated_text.append(g_text)
+
+    cg_generated_ids = []
+    cg_generated_text = []
+    for g in outputs['cg_generated']:
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        if g and g[0] in [eos_token_id, bos_token_id]:
+            g = g[1:]
+        g = g[:g.index(eos_token_id)] if eos_token_id in g else g
+        g = g[:g.index(pad_token_id)] if pad_token_id in g else g
+        g_text = model_vae_module.tokenizer_decoder.decode(g, clean_up_tokenization_spaces=True)
+        cg_generated_ids.append(g)
+        cg_generated_text.append(g_text)
+
+    f = open(os.path.join(eval_output_dir, "reconstruction{}.txt".format(('_'+str(epoch)) if epoch is not None else '')), 'w')
+    f.write('\n'.join([g + '\n' + t for g, t in zip(generated_text, tgt_seq_text)]))
+    fat = open(os.path.join(eval_output_dir, "attribute_transfer{}.txt".format(('_'+str(epoch)) if epoch is not None else '')), 'w')
+    fat.write('\n'.join([g + '\n' + t for g, t in zip(at_generated_text, tgt_seq_text)]))
+    fcg = open(os.path.join(eval_output_dir, "conditional_generation{}.txt".format(('_'+str(epoch)) if epoch is not None else '')), 'w')
+    fcg.write('\n'.join(cg_generated_text))
+
+    rec_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t)] for t in tgt_seq_text],
+                                                     hypotheses=[nltk.word_tokenize(g) for g in generated_text])
+
+    at_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t)] for t in tgt_seq_text],
+                                                    hypotheses=[nltk.word_tokenize(g) for g in at_generated_text])
+
+    cg_generated_text_subset = cg_generated_text[:500]  # use a subset, otherwise it takes a long time to compute.
+    cg_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t) for t in tgt_seq_text] for _ in range(len(cg_generated_text_subset))],
+                                                    hypotheses=[nltk.word_tokenize(g) for g in cg_generated_text_subset])
+
+    cg_self_bleu = nltk.translate.bleu_score.corpus_bleu(list_of_references=[[nltk.word_tokenize(t) for t in cg_generated_text_subset[:i]+cg_generated_text_subset[i+1:]]
+                                                         for i in range(len(cg_generated_text_subset))],
+                                                         hypotheses=[nltk.word_tokenize(g) for g in cg_generated_text_subset])
+
+    metrics['rec_bleu'] = rec_bleu
+    metrics['at_bleu'] = at_bleu
+    metrics['cg_bleu'] = cg_bleu
+    metrics['cg_self_bleu'] = cg_self_bleu
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    writer = open(output_eval_file, "w")
+    logger.info("***** Eval results, global steps: {} *****".format(prefix))
+    for key, value in metrics.items():
+        logger.info("  %s = %s", key, str(value))
+        writer.write("%s = %s\n" % (key, str(value)))
+
+    return metrics
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--output_dir", default='results_cara', type=str, help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--soft_temperature", type=float, default=0.5)
+    parser.add_argument("--top_k", type=int, default=5)
+    parser.add_argument("--top_p", type=float, default=0.0)
+    parser.add_argument("--num_train_epochs", default=10.0, type=float, help="Total number of training epochs to perform.")
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--lambda", default=0, type=float, help="")
+
+    ## Data parameters
+    parser.add_argument("--dataset", default='yelp', type=str, help="The dataset.")
+    # parser.add_argument("--train_data_file", default='../../../data/yelp/sentiment.train.tiny.text', type=str, help="The input training data file (a text file).")
+    parser.add_argument("--train_data_file", default='../../../data/yelp/sentiment.train.text', type=str, help="The input training data file (a text file).")
+    # parser.add_argument("--eval_data_file", default='../../../data/yelp/sentiment.dev.tiny.text', type=str, help="")
+    parser.add_argument("--eval_data_file", default='../../../data/yelp/sentiment.dev.small.text', type=str, help="2000 samples.")
+    parser.add_argument("--ExpName", default="local_lctrlg_yelp", type=str, help="The experiment name used in Azure Table.")
+    parser.add_argument("--create_new", default=0, type=int, help="")
+
+    # Training parameters
+    parser.add_argument("--checkpoint_dir", default='results_arae/checkpoint-47501/pytorch_model.bin', type=str, help='results/checkpoint-1212/pytorch_model.bin')
+    # parser.add_argument("--checkpoint", default='', type=str, help='results/checkpoint-1212/pytorch_model.bin')
+    parser.add_argument("--start_global_step", default=1001, type=int, help='')
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--evaluate_during_training", action='store_true', help="Run evaluation during training at each logging step.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=0, help="Evaluate the results at the given global step")
+    # parser.add_argument('--logging_steps', type=int, default=2000, help="ARAE")
+    parser.add_argument('--logging_steps', type=int, default=10, help="CARA")
+    parser.add_argument('--eval_steps', type=int, default=500, help="CARA")
+    # parser.add_argument('--save_steps', type=int, default=5000, help="ARAE")
+    parser.add_argument('--save_steps', type=int, default=1000, help="CARA")
+    parser.add_argument("--eval_all_checkpoints", action='store_true', help="")
+
+    ## Encoder options
+    # parser.add_argument("--encoder_model_name_or_path", default="bert-base-uncased", type=str, )
+    parser.add_argument("--encoder_model_name_or_path", default="results_cara/checkpoint-encoder-1000", type=str)
+    # parser.add_argument("--encoder_model_name_or_path", default="results/checkpoint-encoder-55000", type=str")
+    parser.add_argument("--encoder_config_name", default="", type=str, help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str, help="Keep empty. Will default to decoder_model_name_or_path")
+    parser.add_argument("--encoder_model_type", default="bert", type=str, help="The encoder model architecture to be fine-tuned.")
+
+    ## Decoder options
+    # parser.add_argument("--decoder_model_name_or_path", default="gpt2", type=str)
+    parser.add_argument("--decoder_model_name_or_path", default="results_cara/checkpoint-decoder-1000", type=str)
+    # parser.add_argument("--decoder_model_name_or_path", default="results/checkpoint-decoder-55000", type=str)
+    parser.add_argument("--decoder_config_name", default="", type=str, help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str, help="Keep empty. Will default to decoder_model_name_or_path")
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str, help="The decoder model architecture to be fine-tuned.")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true', help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true', help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15, help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--cache_dir", default="", type=str, help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=21, type=int, help="21 for Yelp and Yahoo on label-conditional text generation")
+    parser.add_argument("--do_lower_case", action='store_true', help="Set this flag if you are using an uncased model.")
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float, help="Learning schedule, the percentage for the annealing stage.")
+    parser.add_argument("--ratio_zero", default=0.5, type=float, help="Learning schedule, the percentage for the pure auto-encoding stage.")
+    parser.add_argument("--fb_mode", default=1, type=int, help="free bit training mode.")
+    parser.add_argument("--dim_target_kl", default=3.0, type=float, help="dim_target_kl free bit training mode.")
+    parser.add_argument("--learning_rate", default=5e-6, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--use_philly", action='store_true', help="Use Philly for computing.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+    parser.add_argument("--use_pretrained_vae", action='store_true',
+                        help="Use use_pretrained_vae as initialization, where beta value is specified in the folder")
+
+    parser.add_argument("--beta", type=float, default=1.0, help="The weighting hyper-parameter of the KL term in VAE")
+    parser.add_argument("--beta_cls", type=float, default=1.0, help="The weighting hyper-parameter for the classifier on the generated sentences")
+
+    ## IO: Logging and Saving
+    parser.add_argument("--no_cuda", action='store_true', help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', type=int, default=1, help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true', help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42, help="random seed for initialization")
+
+    # Precision & Distributed Training
+    parser.add_argument('--fp16', action='store_true', help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1', help="")
+    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+
+    # New parameters
+    parser.add_argument('--label_size', type=int, default=2, help="This depends on which dataset is used.")
+    args = parser.parse_args()
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file or remove the --do_eval argument.")
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        logger.info("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # pdb.set_trace()
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s', datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + \
+                    '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size)
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+
+
+
+    if args.use_pretrained_model:
+        args.encoder_model_type = args.encoder_model_type.lower()
+        args.decoder_model_type = args.decoder_model_type.lower()
+
+        global_step = args.gloabl_step_eval
+
+        if args.use_pretrained_vae:
+            output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}-1.0'.format(global_step))
+            output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}-1.0'.format(global_step)) 
+        else:
+            output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+            output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+        # Load a trained Encoder model and vocabulary
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+        model_encoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+        # Load a trained Decoder model and vocabulary
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+    else:
+        ## Encoder
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+        model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+        # model_encoder = encoder_model_class(config=encoder_config, latent_size=args.latent_size)
+
+        ## Decoder
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+        setattr(decoder_config, "latent_size", args.latent_size)
+        model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+        # model_decoder = decoder_model_class(config=decoder_config, latent_size=args.latent_size)
+
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    logger.info('We have added {} tokens to GPT2'.format(num_added_toks))
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+    logger.info("Training/evaluation parameters %s", args)
+
+    if not os.path.exists(args.output_dir): os.makedirs(args.output_dir)
+    # Training
+    
+    logff = open(os.path.join(args.output_dir, 'log_{}'.format(get_time_str())), 'a')
+
+    if args.do_train:
+        global_step = args.start_global_step
+        model_vae = CARA(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+
+        # if args.checkpoint:
+        #     logger.info("Loading checkpoint from {}".format(args.checkpoint))
+        #     model_vae.load_state_dict(torch.load(args.checkpoint))
+
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        train_dataset = load_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        # logger.info("Test evaluate before training.")
+        # evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=0, subset='test')
+
+        # Train
+        global_step, tr_loss = train(args, train_dataset, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, logff=logff)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_dir)
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+
+        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+        model_to_save = model_vae.module if hasattr(model_vae, "module") else model_vae
+
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    torch.save(model_to_save.state_dict(), os.path.join(output_dir, 'pytorch_model.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+            torch.save(model_to_save.state_dict(), os.path.join(output_dir, 'pytorch_model.bin'))
+        args.checkpoint = os.path.join(output_dir, 'pytorch_model.bin')
+
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_encoder_to_save.save_pretrained(output_encoder_dir)
+            torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        # model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        # model_encoder.to(args.device)
+        #
+        # # Load a trained model and vocabulary that you have fine-tuned
+        # model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        # model_decoder.to(args.device)
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        # if global_step == 0:
+        #     global_step = args.gloabl_step_eval
+
+        # output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        # output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        # checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+
+        # logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        # for checkpoint in checkpoints:
+
+        # global_step = args.checkpoint_dir.split('/')[-2].split('-')[-1] if args.checkpoint_dir else ""
+
+        # model_encoder = encoder_model_class.from_pretrained(checkpoint[0], latent_size=args.latent_size)
+        # model_encoder.to(args.device)
+        # model_decoder = decoder_model_class.from_pretrained(checkpoint[1], latent_size=args.latent_size)
+        # model_decoder.to(args.device)
+
+        model_vae = CARA(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+
+        if args.gloabl_step_eval < 1:
+            args.gloabl_step_eval = global_step
+            args.checkpoint_dir = os.path.join(args.output_dir, 'checkpoint-{}/pytorch_model.bin'.format(args.gloabl_step_eval))
+        else:
+            global_step = args.gloabl_step_eval
+            args.checkpoint_dir = os.path.join(args.checkpoint_dir, 'checkpoint-{}/pytorch_model.bin'.format(args.gloabl_step_eval))
+
+
+        # if args.checkpoint_dir and os.path.exists(args.checkpoint_dir):
+        #     logger.info("Loading checkpoint from {}".format(args.checkpoint_dir))
+        #     model_vae.load_state_dict(torch.load(args.checkpoint_dir))
+        # else:
+        #     raise ValueError("Cannot find checkpoint at: {}".format(args.checkpoint))
+
+        metrics = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+        metrics = dict((k + '_{}'.format(global_step), v) for k, v in metrics.items())
+        results.update(metrics)
+
+        # result = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='train')
+        # result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+        # results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/run_lm_vae_pretraining.py b/Optimus/code/examples/big_ae/run_lm_vae_pretraining.py
new file mode 100755
index 0000000000000000000000000000000000000000..64e07e48ed1d6bb53da35709b36cbff306604986
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_vae_pretraining.py
@@ -0,0 +1,669 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import random
+from pathlib import Path
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, BucketingDataLoader, BucketingMultipleFiles_DataLoader, frange_cycle_linear, frange_cycle_zero_linear)
+
+from modules import VAE
+
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+        file_path=args.train_data_file
+        dataloader = BucketingMultipleFiles_DataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass 
+    return dataloader
+
+
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+
+
+    files = Path(args.train_data_file)
+    num_files = len(list(files.glob('*seq64*.json')))
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num files = %d", num_files)
+    logger.info("  Num examples of first file = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+    model_vae.zero_grad()
+    num_train_epochs_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+
+    tmp_list = []
+    dict_token_length = defaultdict(int)
+
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in num_train_epochs_iterator:
+        train_dataloader.reset()
+        for idx_file in range(num_files-1):
+            logger.info(f"Epoch {epoch}, File idx {train_dataloader.file_idx}") 
+            epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+            for step, batch in enumerate(epoch_iterator):
+
+                tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+                
+                dict_token_length[ tokenized_text_lengths[0,0].item() ] += 1
+                
+                # continue
+
+
+                # tokenized_text0 = tokenized_text0.to(args.device)
+                # tokenized_text1 = tokenized_text1.to(args.device)
+                # prepare input-output data for reconstruction
+
+                
+
+                inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+                labels = tokenized_text1
+
+                tokenized_text1 = tokenized_text1.to(args.device)
+                inputs = inputs.to(args.device)
+                labels = labels.to(args.device)
+
+                model_vae.train()
+
+                beta_t = 0.0 # beta_t_list[step +  epoch*len(epoch_iterator)]
+                model_vae.module.args.beta = beta_t
+
+                if beta_t == 0.0:
+                    model_vae.module.args.fb_mode = 0
+                else:
+                    model_vae.module.args.fb_mode = 1
+                
+                if args.use_deterministic_connect:
+                    model_vae.module.args.fb_mode = 2
+
+                loss_rec, loss_kl, loss = model_vae(inputs, labels)
+
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+
+                if args.use_philly:
+                    print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4))) 
+                    print("EVALERR: {}%".format(loss_rec)) 
+
+                epoch_iterator.set_description(
+                    (
+                        f'iter: {step +  epoch*len(epoch_iterator) }; file:{idx_file}; loss: {loss.item():.3f}; '
+                        f'loss_rec: {loss_rec.item():.3f}; loss_kl: {loss_kl.item():.3f}; '
+                        f'beta: {model_vae.module.args.beta:.3f}'
+                    )
+                )
+
+                # if global_step % 5 == 0:
+                #     row = {
+                #             'PartitionKey': 'MILU_Rule_Rule_Template',
+                #             'RowKey': str(datetime.now()),
+                #             'ExpName' : args.ExpName, 
+                #             'iter': str( step +  epoch*len(epoch_iterator) ),
+                #             'loss': str( loss.item()),
+                #             'loss_rec': str(loss_rec.item()),
+                #             'loss_kl': str(loss_kl.item()),
+                #             'beta': str(model_vae.args.beta)
+                #         }
+                #     # pdb.set_trace()
+                #     ts.insert_entity(table_name, row)
+
+                # pdb.set_trace()
+
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+
+                if args.fp16:
+                    with amp.scale_loss(loss, optimizer) as scaled_loss:
+                        scaled_loss.backward()                                   
+                else:
+                    loss.backward()
+
+                tr_loss += loss.item()
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    if args.fp16:
+                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                    else:
+                        torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+
+                    optimizer.step()
+
+                    scheduler.step()  # Update learning rate schedule
+
+                    model_vae.zero_grad()
+
+                    global_step += 1
+
+
+                    if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        # Log metrics
+                        if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                            results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                            for key, value in results.items():
+                                tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                        tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                        tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                        logging_loss = tr_loss
+
+                    if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                        
+                        # Save encoder model checkpoint
+                        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+
+                        if not os.path.exists(output_encoder_dir):
+                            os.makedirs(output_encoder_dir)
+
+                        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                    torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_encoder_to_save.save_pretrained(output_encoder_dir)
+                            torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_encoder_dir)
+
+                        # Save decoder model checkpoint
+                        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+
+                        if not os.path.exists(output_decoder_dir):
+                            os.makedirs(output_decoder_dir)
+
+                        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                    torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_decoder_to_save.save_pretrained(output_decoder_dir)
+                            torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+
+                if args.max_steps > 0 and global_step > args.max_steps:
+                    epoch_iterator.close()
+                    break
+                
+            if args.max_steps > 0 and global_step > args.max_steps:
+                train_iterator.close()
+                break
+
+
+    # print(dict_token_length)
+    # with open('wikipedia_stats.json', 'w') as fp:
+    #     json.dump(dict_token_length, fp)
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    # model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_encoder_to_save.save_pretrained(output_encoder_dir)
+            torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        model_encoder.to(args.device)
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        model_decoder.to(args.device)
+    
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_vae_pretraining_distributed.py b/Optimus/code/examples/big_ae/run_lm_vae_pretraining_distributed.py
new file mode 100755
index 0000000000000000000000000000000000000000..dcc199521539dc1e548c026289de66bfe67e69a8
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_vae_pretraining_distributed.py
@@ -0,0 +1,678 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import random
+from pathlib import Path
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, BucketingDataLoader, BucketingMultipleFiles_DataLoader, frange_cycle_linear, frange_cycle_zero_linear)
+
+from modules import VAE
+
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+        file_path=args.train_data_file
+        dataloader = BucketingMultipleFiles_DataLoader(file_path, args.train_batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass 
+    return dataloader
+
+
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    n_gpu = torch.cuda.device_count()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+
+    files = Path(args.train_data_file)
+    num_files = len(list(files.glob('*seq64*.json')))
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num files = %d", num_files)
+
+    n_gpu = torch.cuda.device_count()
+    logger.info("  Num examples of first file = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Num GPUs = %d", n_gpu)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.per_gpu_train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+    model_vae.zero_grad()
+    num_train_epochs_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter_per_file = len(train_dataloader) / n_gpu
+    n_iter = int(args.num_train_epochs * n_iter_per_file * num_files ) 
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=10, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+    beta_t = 0.0
+
+    tmp_list = []
+    # dict_token_length = defaultdict(int)
+
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in num_train_epochs_iterator:
+        train_dataloader.reset()
+
+        for idx_file in range(num_files-1):
+            logger.info(f"Epoch {epoch}, File idx {train_dataloader.file_idx}") 
+            epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+
+            for step, batch in enumerate(epoch_iterator):
+
+                tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+                
+                # dict_token_length[ tokenized_text_lengths[0,0].item() ] += 1
+                
+                if (tokenized_text0>len(encoder_tokenizer)-1).sum().item()>0.0 or (tokenized_text0<0).sum().item()>0.0  or (tokenized_text1>len(decoder_tokenizer)-1).sum().item()>0.0  or (tokenized_text1<0).sum().item()>0.0: 
+                    # pdb.set_trace()
+                    logger.info(f"BERT tokens: {tokenized_text0}")
+                    logger.info(f"GPT2 tokens: {tokenized_text1}")
+                    continue
+
+
+                # continue
+
+                # prepare input-output data for reconstruction
+                inputs, labels = tokenized_text0.to(args.device), tokenized_text1.to(args.device)
+
+                model_vae.train()
+
+                if args.use_beta_schedule:
+                    try:
+                        beta_t = beta_t_list[  step +  idx_file* n_iter_per_file ]
+                    except:
+                        beta_t = 0.0
+
+                model_vae.module.args.beta = beta_t
+
+                if beta_t == 0.0:
+                    model_vae.module.args.fb_mode = 0
+                else:
+                    model_vae.module.args.fb_mode = 1
+
+                # save the mini-batch with bugs
+                if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+                    os.makedirs(args.output_dir)
+                
+                torch.save(batch, os.path.join(args.output_dir, f'batch_debug_{step}.pt'))
+
+                loss_rec, loss_kl, loss = model_vae(inputs, labels)
+
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+
+                if args.use_philly:
+                    if args.local_rank in [-1, 0]:
+                        print("PROGRESS: {}%".format(round(100 * (step +  idx_file * n_iter_per_file ) / n_iter  , 4))) 
+                        print("EVALERR: {}%".format(loss_rec)) 
+                        
+                epoch_iterator.set_description(
+                    (
+                        f'iter: {step +  epoch*len(epoch_iterator) }; file:{idx_file}; loss: {loss.item():.3f}; '
+                        f'loss_rec: {loss_rec.item():.3f}; loss_kl: {loss_kl.item():.3f}; '
+                        f'beta: {model_vae.module.args.beta:.3f}'
+                    )
+                )
+                # if global_step % 5 == 0:
+                #     row = {
+                #             'PartitionKey': 'MILU_Rule_Rule_Template',
+                #             'RowKey': str(datetime.now()),
+                #             'ExpName' : args.ExpName, 
+                #             'iter': str( step +  epoch*len(epoch_iterator) ),
+                #             'loss': str( loss.item()),
+                #             'loss_rec': str(loss_rec.item()),
+                #             'loss_kl': str(loss_kl.item()),
+                #             'beta': str(model_vae.args.beta)
+                #         }
+                #     # pdb.set_trace()
+                #     ts.insert_entity(table_name, row)
+
+                # pdb.set_trace()
+
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+
+                if args.fp16:
+                    with amp.scale_loss(loss, optimizer) as scaled_loss:
+                        scaled_loss.backward()                                   
+                else:
+                    loss.backward()
+
+                tr_loss += loss.item()
+
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    if args.fp16:
+                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                    else:
+                        torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+
+                    optimizer.step()
+
+                    scheduler.step()  # Update learning rate schedule
+
+                    model_vae.zero_grad()
+
+                    global_step += 1
+
+
+                    if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        # Log metrics
+                        if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                            results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                            for key, value in results.items():
+                                tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                        tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                        tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                        logging_loss = tr_loss
+
+                    if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                        
+                        # Save encoder model checkpoint
+                        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+
+                        if not os.path.exists(output_encoder_dir):
+                            os.makedirs(output_encoder_dir)
+
+                        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                    torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_encoder_to_save.save_pretrained(output_encoder_dir)
+                            torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_encoder_dir)
+
+                        # Save decoder model checkpoint
+                        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+
+                        if not os.path.exists(output_decoder_dir):
+                            os.makedirs(output_decoder_dir)
+
+                        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                    torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_decoder_to_save.save_pretrained(output_decoder_dir)
+                            torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                            
+
+
+                if args.max_steps > 0 and global_step > args.max_steps:
+                    epoch_iterator.close()
+                    break
+                
+            if args.max_steps > 0 and global_step > args.max_steps:
+                train_iterator.close()
+                break
+
+
+    # print(dict_token_length)
+    # with open('wikipedia_stats.json', 'w') as fp:
+    #     json.dump(dict_token_length, fp)
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_beta_schedule", action='store_true',
+                        help="Use cyclical beta schedule for auto-encoders.")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.5, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=1.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument("--use_distributed_training", action='store_true',
+                        help="Use distributed training for computing.")    
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    logger.info(f'Local rank is {args.local_rank}')
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    # model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_encoder_to_save.save_pretrained(output_encoder_dir)
+            torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_vae_pretraining_phdist.py b/Optimus/code/examples/big_ae/run_lm_vae_pretraining_phdist.py
new file mode 100755
index 0000000000000000000000000000000000000000..7d9c64dc34a9f940b9e8ac9ce68e49eae960ad91
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_vae_pretraining_phdist.py
@@ -0,0 +1,790 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os, sys
+import pickle
+import random
+from pathlib import Path
+import os.path as op
+import time, json
+from io import open
+import re
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+import subprocess
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+try:
+    this_file = __file__
+except NameError:
+    this_file = sys.argv[0]
+this_file = op.abspath(this_file)
+print('current path: {}'.format(os.path.abspath(__file__)))
+print('current folder: {}'.format(op.dirname(this_file)))
+sys.path.insert(0, op.join(op.dirname(this_file), '../..'))
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, BucketingDataLoader, BucketingMultipleFiles_DataLoader, frange_cycle_linear, frange_cycle_zero_linear)
+
+from modules import VAE
+
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+def ompi_rank():
+    """Find OMPI world rank without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_RANK') or 0)
+
+def ompi_size():
+    """Find OMPI world size without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_SIZE') or 1)
+
+def ompi_local_rank():
+    """Find OMPI local rank without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_LOCAL_RANK') or 0)
+
+def ompi_local_size():
+    """Find OMPI local size without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_LOCAL_SIZE') or 1)
+
+def get_master_machine():
+    mpi_host_file = op.expanduser('~/mpi-hosts')
+    with open(mpi_host_file, 'r') as f:
+        master_name = f.readline().strip()
+    return master_name
+
+def get_master_ip(master_name=None):
+    if master_name is None:
+        master_name = get_master_machine()
+    #etc_host_file = '/etc/hosts'
+    etc_host_file = op.expanduser('~/etc-hosts')
+    with open(etc_host_file, 'r') as f:
+        name_ip_pairs = f.readlines()
+    name2ip = {}
+    for name_ip_pair in name_ip_pairs:
+        pair_list = name_ip_pair.split(' ')
+        key = pair_list[1].strip()
+        value = pair_list[0]
+        name2ip[key] = value
+    return name2ip[master_name]
+
+def get_gpus_nocache():
+    """List of NVIDIA GPUs
+    """
+    cmds = 'nvidia-smi --query-gpu=name --format=csv,noheader'.split(' ')
+
+    p = subprocess.Popen(cmds, stdout=subprocess.PIPE)
+    ret = p.communicate()
+    gpus_str = ret[0].decode("utf-8")
+    gpus_arr = [gpu.strip() for gpu in gpus_str.strip().split('\n')]
+    return gpus_arr
+
+_GPUS = get_gpus_nocache()
+print('_GPUs: {}'.format(_GPUS))
+
+def get_gpus():
+    """List of NVIDIA GPUs
+    """
+    return _GPUS
+
+def gpu_indices(divisible=True):
+    """Get the GPU device indices for this process/rank
+    :param divisible: if GPU count of all ranks must be the same
+    :rtype: list[int]
+    """
+    local_size = ompi_local_size()
+    local_rank = ompi_local_rank()
+    assert 0 <= local_rank < local_size, "Invalid local_rank: {} local_size: {}".format(local_rank, local_size)
+    gpu_count = len(get_gpus())
+    assert gpu_count >= local_size > 0, "GPU count: {} must be >= LOCAL_SIZE: {} > 0".format(gpu_count, local_size)
+    if divisible:
+        ngpu = gpu_count / local_size
+        gpus = np.arange(local_rank * ngpu, (local_rank + 1) * ngpu)
+        if gpu_count % local_size != 0:
+            logger.warning("gpu_count: {} not divisible by local_size: {}; some GPUs may be unused".format(gpu_count, local_size))
+    else:
+        gpus = np.array_split(range(gpu_count), local_size)[local_rank]
+
+    ret_gpus = [int(g) for g in gpus]
+    return ret_gpus
+
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+        file_path=args.train_data_file
+        dataloader = BucketingMultipleFiles_DataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass 
+    return dataloader
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    #gpus = list(gpu_indices())
+
+    if args.local_rank in [-1, 0]: tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps #* args.num_train_epochs
+
+    if args.distributed:
+        t_total = t_total // ompi_size()
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    #if args.n_gpu > 1:
+    #    model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    #if args.local_rank != -1:
+        #model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus, output_device=args.local_rank, find_unused_parameters=True)
+        #model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus)
+
+
+    files = Path(args.train_data_file)
+    num_files = len(list(files.glob('*seq64*.json')))
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num files = %d", num_files)
+    logger.info("  Num examples of first file = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+    model_vae.zero_grad()
+    num_train_epochs_iterator = trange(int(args.num_train_epochs), desc="Epoch") #, disable=args.local_rank not in [-1, 0])
+
+    #n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    #beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+    n_iter_per_file = len(train_dataloader) / args.per_gpu_train_batch_size
+    n_iter = int(args.num_train_epochs * n_iter_per_file * num_files)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta, n_cycle=10,  ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+    beta_t = 0.0
+
+    pdb.set_trace()
+    tmp_list = []
+    dict_token_length = defaultdict(int)
+
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in range(int(args.num_train_epochs)): # num_train_epochs_iterator:
+        train_dataloader.reset()
+        for idx_file in range(num_files-1):
+            logger.info(f"Rank {ompi_rank()}, Epoch {epoch}, File idx {train_dataloader.file_idx}")
+            #epoch_iterator = tqdm(train_dataloader, desc="Iteration") #disable=disable=args.local_rank not in [-1, 0])
+            for step, batch in enumerate(train_dataloader):
+                tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+                
+                #dict_token_length[tokenized_text_lengths[0,0].item()] += 1
+                
+                # continue
+                # tokenized_text0 = tokenized_text0.to(args.device)
+                # tokenized_text1 = tokenized_text1.to(args.device)
+                # prepare input-output data for reconstruction
+
+                inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+                labels = tokenized_text1
+
+                tokenized_text1 = tokenized_text1.to(args.device)
+                inputs = inputs.to(args.device)
+                labels = labels.to(args.device)
+
+                model_vae.train()
+
+                if args.use_beta_schedule:
+                    if global_step >= len(beta_t_list):
+                        beta_t = 1.0
+                    else:
+                        beta_t = beta_t_list[global_step]
+
+                    #try:
+                    #    beta_t = beta_t_list[global_step] #[step + idx_file* n_iter_per_file]
+                    #except:
+                    #    beta_t = 0.0
+
+                #beta_t = 0.0 # beta_t_list[step +  epoch*len(epoch_iterator)]
+                model_vae.module.args.beta = beta_t
+
+                if beta_t == 0.0:
+                    model_vae.module.args.fb_mode = 0
+                else:
+                    model_vae.module.args.fb_mode = 1
+                
+                if args.use_deterministic_connect:
+                    model_vae.module.args.fb_mode = 2
+
+                loss_rec, loss_kl, loss = model_vae(inputs, labels)
+
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+
+                if args.use_philly:
+                    #if args.local_rank in [-1, 0]:
+                    if args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        logger.info("Steps {}, Rank {}, File {}, Epoch: [{}/{}][{}/{}], Beta: {}, Loss: {}".format(global_step, ompi_rank(), train_dataloader.file_idx,
+                                    epoch, args.num_train_epochs, step, len(train_dataloader), model_vae.module.args.beta, loss_rec))
+                        logger.info("PROGRESS: {}%".format(round(100*(step + epoch*len(train_dataloader))/(int(args.num_train_epochs) * len(train_dataloader)), 4)))
+                        logger.info("EVALERR: {}%".format(loss_rec))
+                        
+                #epoch_iterator.set_description(
+                #    (
+                #        f'rank: {ompi_rank()}; '
+                #       f'iter: {step +  epoch*len(epoch_iterator) }; file:{idx_file}; loss: {loss.item():.3f}; '
+                #        f'loss_rec: {loss_rec.item():.3f}; loss_kl: {loss_kl.item():.3f}; '
+                #        f'beta: {model_vae.module.args.beta:.3f}'
+                #    )
+                #)
+                # if global_step % 5 == 0:
+                #     row = {
+                #             'PartitionKey': 'MILU_Rule_Rule_Template',
+                #             'RowKey': str(datetime.now()),
+                #             'ExpName' : args.ExpName, 
+                #             'iter': str( step +  epoch*len(epoch_iterator) ),
+                #             'loss': str( loss.item()),
+                #             'loss_rec': str(loss_rec.item()),
+                #             'loss_kl': str(loss_kl.item()),
+                #             'beta': str(model_vae.args.beta)
+                #         }
+                #     # pdb.set_trace()
+                #     ts.insert_entity(table_name, row)
+
+                # pdb.set_trace()
+
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+
+                if args.fp16:
+                    with amp.scale_loss(loss, optimizer) as scaled_loss:
+                        scaled_loss.backward()                                   
+                else:
+                    loss.backward()
+
+                tr_loss += loss.item()
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    if args.fp16:
+                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                    else:
+                        torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+
+                    optimizer.step()
+                    scheduler.step()  # Update learning rate schedule
+                    model_vae.zero_grad()
+
+                    global_step += 1
+
+                    if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        # Log metrics
+                        if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                            results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                            for key, value in results.items():
+                                tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                        tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                        tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                        logging_loss = tr_loss
+
+                    if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                        
+                        # Save encoder model checkpoint
+                        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}-{}'.format(global_step, model_vae.module.args.beta))
+
+                        if not os.path.exists(output_encoder_dir):
+                            os.makedirs(output_encoder_dir)
+
+                        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                    torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_encoder_to_save.save_pretrained(output_encoder_dir)
+                            torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_encoder_dir)
+
+                        # Save decoder model checkpoint
+                        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}-{}'.format(global_step, model_vae.module.args.beta))
+
+                        if not os.path.exists(output_decoder_dir):
+                            os.makedirs(output_decoder_dir)
+
+                        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+                        if args.use_philly:
+                            save_solid = False
+                            while not save_solid:
+                                try:
+                                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                    torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                    logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                    save_solid = True
+                                except:
+                                    pass
+                        else:
+                            model_decoder_to_save.save_pretrained(output_decoder_dir)
+                            torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                            logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+
+                if args.max_steps > 0 and global_step > args.max_steps:
+                    #epoch_iterator.close()
+                    break
+                
+            if args.max_steps > 0 and global_step > args.max_steps:
+                train_iterator.close()
+                break
+
+
+    # print(dict_token_length)
+    # with open('wikipedia_stats.json', 'w') as fp:
+    #     json.dump(dict_token_length, fp)
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_beta_schedule", action='store_true', help="Use cyclical beta schedule for auto-encoders.")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+
+    parser.add_argument('--world-size', default=ompi_size(), type=int, help='number of distributed processes')
+    parser.add_argument('--dist-url', default='tcp://' + get_master_ip() + ':23456', type=str,
+                        help='url used to set up distributed training')
+    parser.add_argument('--dist-backend', default='nccl', type=str, help='distributed backend')
+    parser.add_argument('--port', type=str, default='51115', help="Port")
+
+    args = parser.parse_args()
+
+    args.dist_url = 'tcp://' + get_master_ip() + ':' + args.port
+
+    # Setup logging
+    logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt='%m/%d/%Y %H:%M:%S',
+                        level=logging.INFO)
+    logger = logging.getLogger(__name__)
+
+    rank_node = ompi_rank()
+    args.distributed = args.world_size > 1
+    logger.info("Rank {} distributed: {}".format(rank_node, args.distributed))
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    if args.distributed:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.distributed.init_process_group(
+            backend=args.dist_backend,
+            init_method=args.dist_url,
+            world_size=args.world_size,
+            rank=ompi_rank(),
+            group_name='mtorch')
+        logger.info("World Size is {}, Backend is {}, Init Method is {}, rank is {}".format(args.world_size, args.dist_backend, args.dist_url, ompi_rank()))
+
+    gpus = list(gpu_indices())
+    args.n_gpu = len(gpus)
+    args.local_rank = ompi_rank() #gpus[0]
+    torch.cuda.set_device(gpus[0])
+    device = torch.device("cuda", gpus[0])
+
+    args.device = device
+    logger.info('Rank {}, gpus: {}, get_rank: {}'.format(rank_node, gpus, torch.distributed.get_rank()))
+    logger.info(f'Local rank is {args.local_rank}, {rank_node}')
+
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s", args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size) + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    if ompi_rank() == 0:
+        try:
+            ts.create_table(table_name)
+        except:
+            pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    #if args.local_rank not in [-1, 0]: torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    setattr(decoder_config, "latent_size", args.latent_size)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    #model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) #
+    #model_vae.cuda()
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.distributed:
+        # model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus, output_device=args.local_rank, find_unused_parameters=True)
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus)
+    elif args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae)#.to(args.device)
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    #if args.local_rank == 0: torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step=0
+    if args.do_train:
+        #if args.local_rank not in [-1, 0]: torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        #if args.local_rank == 0: torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info("Rank %d, global_step = %s, average loss = %s", ompi_rank(), global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_encoder_to_save.save_pretrained(output_encoder_dir)
+            torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+    
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_vae_pretraining_phdist_beta.py b/Optimus/code/examples/big_ae/run_lm_vae_pretraining_phdist_beta.py
new file mode 100755
index 0000000000000000000000000000000000000000..0f5abee50aa87dec91d9d5769fb097185ad85e25
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_vae_pretraining_phdist_beta.py
@@ -0,0 +1,771 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os, sys
+import pickle
+import random
+from pathlib import Path
+import os.path as op
+import time, json
+from io import open
+import re
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+import subprocess
+
+import torch.nn.init as init
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+try:
+    this_file = __file__
+except NameError:
+    this_file = sys.argv[0]
+this_file = op.abspath(this_file)
+print('current path: {}'.format(os.path.abspath(__file__)))
+print('current folder: {}'.format(op.dirname(this_file)))
+sys.path.insert(0, op.join(op.dirname(this_file), '../..'))
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, BucketingDataLoader, BucketingMultipleFiles_DataLoader, frange_cycle_linear, frange_cycle_zero_linear)
+
+from modules import VAE
+
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+def ompi_rank():
+    """Find OMPI world rank without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_RANK') or 0)
+
+def ompi_size():
+    """Find OMPI world size without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_SIZE') or 1)
+
+def ompi_local_rank():
+    """Find OMPI local rank without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_LOCAL_RANK') or 0)
+
+def ompi_local_size():
+    """Find OMPI local size without calling mpi functions
+    :rtype: int
+    """
+    return int(os.environ.get('OMPI_COMM_WORLD_LOCAL_SIZE') or 1)
+
+def get_master_machine():
+    mpi_host_file = op.expanduser('~/mpi-hosts')
+    with open(mpi_host_file, 'r') as f:
+        master_name = f.readline().strip()
+    return master_name
+
+def get_master_ip(master_name=None):
+    if master_name is None:
+        master_name = get_master_machine()
+    #etc_host_file = '/etc/hosts'
+    etc_host_file = op.expanduser('~/etc-hosts')
+    with open(etc_host_file, 'r') as f:
+        name_ip_pairs = f.readlines()
+    name2ip = {}
+    for name_ip_pair in name_ip_pairs:
+        pair_list = name_ip_pair.split(' ')
+        key = pair_list[1].strip()
+        value = pair_list[0]
+        name2ip[key] = value
+    return name2ip[master_name]
+
+def get_gpus_nocache():
+    """List of NVIDIA GPUs
+    """
+    cmds = 'nvidia-smi --query-gpu=name --format=csv,noheader'.split(' ')
+
+    p = subprocess.Popen(cmds, stdout=subprocess.PIPE)
+    ret = p.communicate()
+    gpus_str = ret[0].decode("utf-8")
+    gpus_arr = [gpu.strip() for gpu in gpus_str.strip().split('\n')]
+    return gpus_arr
+
+_GPUS = get_gpus_nocache()
+print('_GPUs: {}'.format(_GPUS))
+
+def get_gpus():
+    """List of NVIDIA GPUs
+    """
+    return _GPUS
+
+def gpu_indices(divisible=True):
+    """Get the GPU device indices for this process/rank
+    :param divisible: if GPU count of all ranks must be the same
+    :rtype: list[int]
+    """
+    local_size = ompi_local_size()
+    local_rank = ompi_local_rank()
+    assert 0 <= local_rank < local_size, "Invalid local_rank: {} local_size: {}".format(local_rank, local_size)
+    gpu_count = len(get_gpus())
+    assert gpu_count >= local_size > 0, "GPU count: {} must be >= LOCAL_SIZE: {} > 0".format(gpu_count, local_size)
+    if divisible:
+        ngpu = gpu_count / local_size
+        gpus = np.arange(local_rank * ngpu, (local_rank + 1) * ngpu)
+        if gpu_count % local_size != 0:
+            logger.warning("gpu_count: {} not divisible by local_size: {}; some GPUs may be unused".format(gpu_count, local_size))
+    else:
+        gpus = np.array_split(range(gpu_count), local_size)[local_rank]
+
+    ret_gpus = [int(g) for g in gpus]
+    return ret_gpus
+
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+        file_path=args.train_data_file
+        dataloader = BucketingMultipleFiles_DataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=True)
+    else:
+        pass 
+    return dataloader
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def weights_init_rondom(model):
+    model = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+    model_state_dict = model.state_dict()
+    for key in model_state_dict:
+        init.normal_(model_state_dict[key].data)  
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+
+def save_checkpoint(model_vae, optimizer, global_step, args):
+
+    # Create output directory if needed
+    # Save model checkpoint
+    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+    if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(output_encoder_dir)
+    if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(output_decoder_dir)
+
+    logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+    logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+    # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+    # They can then be reloaded using `from_pretrained()`
+
+    model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+    model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+
+    # Good practice: save your training arguments together with the trained model
+    if args.use_philly:
+        save_solid = False
+        while not save_solid:
+            try:
+                model_encoder_to_save.save_pretrained(output_encoder_dir)
+                torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                save_solid = True
+            except:
+                pass
+    else:
+        model_encoder_to_save.save_pretrained(output_encoder_dir)
+        torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+
+    if args.use_philly:
+        save_solid = False
+        while not save_solid:
+            try:
+                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                save_solid = True
+            except:
+                pass
+    else:
+        model_decoder_to_save.save_pretrained(output_decoder_dir)
+        torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+
+
+    # save the full model and optmizer into a checkpoint
+    model_to_save = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    checkpoint = {
+    'iter': global_step,
+    'model_state_dict': model_to_save.state_dict(),
+    'optimizer_state_dict': optimizer.state_dict(),
+    'beta': model_to_save.args.beta,
+    'args': args
+    }
+
+    output_full_dir = os.path.join(args.output_dir, 'checkpoint-full-{}'.format(global_step))
+    if not os.path.exists(output_full_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(output_full_dir)
+
+    logger.info("Start saving full model checkpoint to %s", output_full_dir)
+    if args.use_philly:
+        save_solid = False
+        n_save_attempts = 0
+        while not save_solid:
+            try:
+                n_save_attempts += 1
+                logger.info(f"Saving full checkpoint: {n_save_attempts} attempts made")
+                torch.save(checkpoint, os.path.join(output_full_dir, 'training.bin'))
+                logger.info("Saving full checkpoint to %s,", output_full_dir)
+                save_solid = True
+            except:
+                pass
+    else:
+        torch.save(checkpoint, os.path.join(output_full_dir, 'training.bin'))
+        logger.info("Saving full checkpoint to %s", output_full_dir)
+
+
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    #gpus = list(gpu_indices())
+
+    if args.local_rank in [-1, 0]: tb_writer = SummaryWriter()
+
+
+    args.n_gpu = (torch.distributed.get_world_size() if args.local_rank != -1 else 1)
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps #* args.num_train_epochs
+
+    if args.distributed:
+        t_total = t_total // ompi_size()
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    #if args.n_gpu > 1:
+    #    model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    #if args.local_rank != -1:
+        #model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus, output_device=args.local_rank, find_unused_parameters=True)
+        #model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus)
+
+
+    files = Path(args.train_data_file)
+    num_files = len(list(files.glob('*seq64*.json')))
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num files = %d", num_files)
+    logger.info("  Num examples of first file = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+    model_vae.zero_grad()
+    num_train_epochs_iterator = trange(int(args.num_train_epochs), desc="Epoch") #, disable=args.local_rank not in [-1, 0])
+
+    #n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    n_iter_per_file = train_dataloader.num_examples / args.train_batch_size
+    n_iter = int(args.num_train_epochs * n_iter_per_file * num_files)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta, n_cycle=10,  ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+    logger.info(f"Total iters (estimated): {n_iter}; Length of beta schedule: {len(beta_t_list)}; #Iter per file {n_iter_per_file}")
+
+    beta_t = 0.0
+    tmp_list = []
+    dict_token_length = defaultdict(int)
+
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in range(int(args.num_train_epochs)): # num_train_epochs_iterator:
+        train_dataloader.reset()
+        for idx_file in range(num_files-1):
+
+            logger.info(f"Rank {ompi_rank()}, Epoch {epoch}, File idx {train_dataloader.file_idx}")
+            #epoch_iterator = tqdm(train_dataloader, desc="Iteration") #disable=disable=args.local_rank not in [-1, 0])
+            for step, batch in enumerate(train_dataloader):
+                tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+                
+                #dict_token_length[tokenized_text_lengths[0,0].item()] += 1
+                # continue
+                # tokenized_text0 = tokenized_text0.to(args.device)
+                # tokenized_text1 = tokenized_text1.to(args.device)
+                # prepare input-output data for reconstruction
+
+                inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+                labels = tokenized_text1
+
+                tokenized_text1 = tokenized_text1.to(args.device)
+                inputs = inputs.to(args.device)
+                labels = labels.to(args.device)
+
+                model_vae.train()
+
+                if args.use_beta_schedule:
+                    if global_step >= len(beta_t_list):
+                        beta_t = 1.0
+                    else:
+                        beta_t = beta_t_list[global_step]
+
+                    #try:
+                    #    beta_t = beta_t_list[global_step] #[step + idx_file* n_iter_per_file]
+                    #except:
+                    #    beta_t = 0.0
+
+                #beta_t = 0.0 # beta_t_list[step +  epoch*len(epoch_iterator)]
+                model_vae.module.args.beta = beta_t
+
+                if beta_t == 0.0:
+                    model_vae.module.args.fb_mode = 0
+                else:
+                    model_vae.module.args.fb_mode = 1
+                
+                if args.use_deterministic_connect:
+                    model_vae.module.args.fb_mode = 2
+
+                loss_rec, loss_kl, loss = model_vae(inputs, labels)
+
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+
+                if args.use_philly:
+                    #if args.local_rank in [-1, 0]:
+                    if args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        logger.info("Steps {}, Rank {}, File {}, Epoch: [{}/{}][{}/{}], Beta: {}, Loss: {}".format(global_step, ompi_rank(), train_dataloader.file_idx,
+                                    epoch, args.num_train_epochs, step, n_iter_per_file, model_vae.module.args.beta, loss_rec))
+                        logger.info("PROGRESS: {}%".format(round(100 * global_step /n_iter, 4)))
+                        logger.info("EVALERR: {}%".format(loss_rec))
+
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+
+                if args.fp16:
+                    with amp.scale_loss(loss, optimizer) as scaled_loss:
+                        scaled_loss.backward()                                   
+                else:
+                    loss.backward()
+
+                tr_loss += loss.item()
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    if args.fp16:
+                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                    else:
+                        torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+
+                    optimizer.step()
+                    scheduler.step()  # Update learning rate schedule
+                    model_vae.zero_grad()
+
+                    global_step += 1
+
+                    if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                        # Log metrics
+                        if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                            results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                            for key, value in results.items():
+                                tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                        tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                        tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                        logging_loss = tr_loss
+
+                    if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                        save_checkpoint(model_vae, optimizer, global_step, args)
+       
+
+                if args.max_steps > 0 and global_step > args.max_steps:
+                    #epoch_iterator.close()
+                    break
+                
+
+
+    # print(dict_token_length)
+    # with open('wikipedia_stats.json', 'w') as fp:
+    #     json.dump(dict_token_length, fp)
+
+    return global_step, tr_loss / global_step, optimizer
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_beta_schedule", action='store_true', help="Use cyclical beta schedule for auto-encoders.")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--use_random_weight", action='store_true',
+                        help="Use random weights as initialization")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+
+    parser.add_argument('--world-size', default=ompi_size(), type=int, help='number of distributed processes')
+    parser.add_argument('--dist-url', default='tcp://' + get_master_ip() + ':23456', type=str,
+                        help='url used to set up distributed training')
+    parser.add_argument('--dist-backend', default='nccl', type=str, help='distributed backend')
+    parser.add_argument('--port', type=str, default='51115', help="Port")
+
+    args = parser.parse_args()
+
+    args.dist_url = 'tcp://' + get_master_ip() + ':' + args.port
+
+    # Setup logging
+    logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt='%m/%d/%Y %H:%M:%S',
+                        level=logging.INFO)
+    logger = logging.getLogger(__name__)
+
+    rank_node = ompi_rank()
+    args.distributed = args.world_size > 1
+    logger.info("Rank {} distributed: {}".format(rank_node, args.distributed))
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    if args.distributed:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.distributed.init_process_group(
+            backend=args.dist_backend,
+            init_method=args.dist_url,
+            world_size=args.world_size,
+            rank=ompi_rank(),
+            group_name='mtorch')
+        logger.info("World Size is {}, Backend is {}, Init Method is {}, rank is {}".format(args.world_size, args.dist_backend, args.dist_url, ompi_rank()))
+
+    gpus = list(gpu_indices())
+    args.n_gpu = len(gpus)
+    args.local_rank = ompi_rank() #gpus[0]
+    torch.cuda.set_device(gpus[0])
+    device = torch.device("cuda", gpus[0])
+
+    args.device = device
+    logger.info('Rank {}, gpus: {}, get_rank: {}'.format(rank_node, gpus, torch.distributed.get_rank()))
+    logger.info(f'Local rank is {args.local_rank}, {rank_node}')
+
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s", args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size) + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero)
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    if ompi_rank() == 0:
+        try:
+            ts.create_table(table_name)
+        except:
+            pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    #if args.local_rank not in [-1, 0]: torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    setattr(decoder_config, "latent_size", args.latent_size)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    #model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) #
+    #model_vae.cuda()
+    if args.use_random_weight:
+        model_vae.apply(weights_init_rondom)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.distributed:
+        # model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus, output_device=args.local_rank, find_unused_parameters=True)
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=gpus)
+    elif args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae)#.to(args.device)
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    #if args.local_rank == 0: torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step=0
+    if args.do_train:
+        #if args.local_rank not in [-1, 0]: torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        #if args.local_rank == 0: torch.distributed.barrier()
+
+        global_step, tr_loss, optimizer = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info("Rank %d, global_step = %s, average loss = %s", ompi_rank(), global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        save_checkpoint(model_vae, optimizer, global_step, args)
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_vae_training.py b/Optimus/code/examples/big_ae/run_lm_vae_training.py
new file mode 100755
index 0000000000000000000000000000000000000000..6acd77796d185af26a583c3c577a3d10102a9710
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_vae_training.py
@@ -0,0 +1,979 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+import torch.nn.init as init
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (weight_init, calc_iwnll, calc_rec, calc_mi, calc_au, BucketingDataLoader, TextDataset_Split, TextDataset_2Tokenizers, frange_cycle_linear, frange_cycle_zero_linear)
+
+
+from modules import VAE
+
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+        dataloader = BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=100, shuffle=False)
+    else:
+        pass 
+    return dataloader
+
+
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+def weights_init_rondom(model):
+    model = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+    model_state_dict = model.state_dict()
+    for key in model_state_dict:
+        pdb.set_trace()
+        if 'encoder' in key:
+            init.normal_(model_state_dict[key].data)  
+        # weight_init(item)
+
+def save_checkpoint(model_vae, optimizer, global_step, args):
+
+    # Create output directory if needed
+    # Save model checkpoint
+    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+    if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(output_encoder_dir)
+    if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(output_decoder_dir)
+
+    logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+    logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+    # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+    # They can then be reloaded using `from_pretrained()`
+
+    model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+    model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+
+    # Good practice: save your training arguments together with the trained model
+    if args.use_philly:
+        save_solid = False
+        while not save_solid:
+            try:
+                model_encoder_to_save.save_pretrained(output_encoder_dir)
+                torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                save_solid = True
+            except:
+                pass
+    else:
+        model_encoder_to_save.save_pretrained(output_encoder_dir)
+        torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+
+    if args.use_philly:
+        save_solid = False
+        while not save_solid:
+            try:
+                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                save_solid = True
+            except:
+                pass
+    else:
+        model_decoder_to_save.save_pretrained(output_decoder_dir)
+        torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+
+
+    # save the full model and optmizer into a checkpoint
+    model_to_save = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    checkpoint = {
+    'iter': global_step,
+    'model_state_dict': model_to_save.state_dict(),
+    'optimizer_state_dict': optimizer.state_dict(),
+    'beta': model_to_save.args.beta,
+    'args': args
+    }
+
+    output_full_dir = os.path.join(args.output_dir, 'checkpoint-full-{}'.format(global_step))
+    if not os.path.exists(output_full_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(output_full_dir)
+
+    logger.info("Start saving full model checkpoint to %s", output_full_dir)
+    if args.use_philly:
+        save_solid = False
+        n_save_attempts = 0
+        while not save_solid:
+            try:
+                n_save_attempts += 1
+                logger.info(f"Saving full checkpoint: {n_save_attempts} attempts made")
+                torch.save(checkpoint, os.path.join(output_full_dir, 'training.bin'))
+                logger.info("Saving full checkpoint to %s,", output_full_dir)
+                save_solid = True
+            except:
+                pass
+    else:
+        torch.save(checkpoint, os.path.join(output_full_dir, 'training.bin'))
+        logger.info("Saving full checkpoint to %s", output_full_dir)
+
+
+
+def load_checkpoint(args, loading_step=None):
+
+    args.encoder_model_type = args.encoder_model_type.lower()
+    args.decoder_model_type = args.decoder_model_type.lower()
+    if loading_step:
+        global_step = args.gloabl_step_eval
+    else:
+        global_step = args.gloabl_step_eval
+
+    output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+    output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+    output_full_dir    = os.path.join(args.checkpoint_dir, 'checkpoint-full-{}'.format(global_step)) 
+
+    checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+    logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+    # Load a trained Encoder model and vocabulary
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+    model_encoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+    # Load a trained Decoder model and vocabulary
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    model_decoder.to(args.device)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+    # Load full model
+    checkpoint = torch.load(os.path.join(output_full_dir, 'training.bin'))
+
+
+
+
+def train(args, train_dataloader, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model_vae.zero_grad()
+   
+    # model_vae = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training   
+    
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+
+    tmp_list = []
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            # tokenized_text0 = tokenized_text0.to(args.device)
+            # tokenized_text1 = tokenized_text1.to(args.device)
+            # prepare input-output data for reconstruction
+
+            # if (tokenized_text0>len(encoder_tokenizer)).sum().item()>0.0 or (tokenized_text1>len(decoder_tokenizer)).sum().item()>0.0: 
+            #     pdb.set_trace()
+            #     continue
+
+            inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+            labels = tokenized_text1
+
+            tokenized_text1 = tokenized_text1.to(args.device)
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+
+            model_vae.train()
+
+            beta_t = beta_t_list[step +  epoch*len(epoch_iterator)]
+            model_vae.module.args.beta = beta_t
+
+            if beta_t == 0.0:
+                model_vae.module.args.fb_mode = 0
+            else:
+                model_vae.module.args.fb_mode = 1
+            
+            if args.use_deterministic_connect:
+                model_vae.module.args.fb_mode = 2
+
+
+            loss_rec, loss_kl, loss = model_vae(inputs, labels)
+
+
+            # Chunyuan: loss_rec size is [4], while latent_z size is [12]
+            if args.n_gpu > 1:
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4))) 
+                print("EVALERR: {}%".format(loss_rec)) 
+
+            epoch_iterator.set_description(
+                (
+                    f'iter: {step +  epoch*len(epoch_iterator) }; loss: {loss.item():.3f}; '
+                    f'loss_rec: {loss_rec.item():.3f}; loss_kl: {loss_kl.item():.3f}; '
+                    f'beta: {model_vae.module.args.beta:.3f}'
+                )
+            )
+
+            # if global_step % 5 == 0:
+            #     row = {
+            #             'PartitionKey': 'MILU_Rule_Rule_Template',
+            #             'RowKey': str(datetime.now()),
+            #             'ExpName' : args.ExpName, 
+            #             'iter': str( step +  epoch*len(epoch_iterator) ),
+            #             'loss': str( loss.item()),
+            #             'loss_rec': str(loss_rec.item()),
+            #             'loss_kl': str(loss_kl.item()),
+            #             'beta': str(model_vae.args.beta)
+            #         }
+            #     # pdb.set_trace()
+            #     ts.insert_entity(table_name, row)
+
+            # pdb.set_trace()
+
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()                                   
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+
+                optimizer.step()
+
+                scheduler.step()  # Update learning rate schedule
+
+                model_vae.zero_grad()
+
+                global_step += 1
+
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    save_checkpoint(model_vae, optimizer, global_step, args)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+
+            
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step, optimizer
+
+
+def evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test"):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    # if subset == 'test':
+    #     eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    # elif subset == 'train':
+    #     eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+    logger.info("***** Running evaluation on {} dataset *****".format(subset))
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.per_gpu_eval_batch_size = 1
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+
+    # Note that DistributedSampler samples randomly
+    # eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    # eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    
+    model_vae.eval()
+
+    model_vae =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    mi = calc_mi(model_vae, eval_dataloader, args)
+    au = calc_au(model_vae, eval_dataloader, delta=0.01, args=args)[0]
+    ppl, elbo, nll, kl = calc_iwnll(model_vae, eval_dataloader, args, ns=100)
+
+    result = {
+        "perplexity": ppl, "elbo": elbo, "kl": kl, "nll": nll, "au": au, "mi": mi
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results *****")
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    row = {
+            'PartitionKey': 'MILU_Rule_Rule_Template',
+            'RowKey': str(datetime.now()),
+            'ExpName' : args.ExpName, 
+            'test_perplexity': str( ppl ),
+            'test_elbo': str( elbo ),
+            'test_nll': str(nll),
+            'test_au': str(au),
+            'test_mi': str(mi)
+        }
+    # pdb.set_trace()
+    # ts.insert_entity(table_name, row)
+
+
+    return result
+
+
+
+
+def evaluate_rec(args, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test"):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+
+    if subset == 'test':
+        eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    elif subset == 'train':
+        eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+    logger.info("***** Running evaluation on {} dataset *****".format(subset))
+
+    # Note that DistributedSampler samples randomly
+    # eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    # eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    
+    model_vae.eval()
+    model_vae =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+    nll_s, nll_w = calc_rec(model_vae, eval_dataloader, args, ns=1)
+
+    result = {
+        "rec_w": nll_w, "rec_s": nll_s
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("%s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return result
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--checkpoint_dir", default=None, type=str,
+                        help="The directory where checkpoints are saved.")                        
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+    parser.add_argument("--save_bert_gpt_init", action='store_true',
+                        help="Use Philly for computing.")
+    parser.add_argument("--length_weighted_loss", action='store_true',
+                        help="Use sentence length re-weight the reconstruction loss.")
+
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+    parser.add_argument("--latent_as_gpt_memory", default=1, type=int, help="Latent vector as memery for GPT2 to attend.")
+    parser.add_argument("--latent_as_gpt_emb", default=1, type=int, help="Latent vector as embeddings for GPT2.")
+    
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--do_eval_rec", action='store_true',
+                        help="Whether to run eval reconstruction on a set of models.")   
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    parser.add_argument("--use_pretrained_vae", action='store_true',
+                        help="Use use_pretrained_vae as initialization, where beta value is specified in the folder")
+    parser.add_argument("--use_random_weight", action='store_true',
+                        help="Use random weights as initialization")
+
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    # Load Optimius pre-trained model and tokenizer
+    if args.use_pretrained_model:
+        args.encoder_model_type = args.encoder_model_type.lower()
+        args.decoder_model_type = args.decoder_model_type.lower()
+
+        global_step = args.gloabl_step_eval
+
+        output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+        output_full_dir    = os.path.join(args.checkpoint_dir, 'checkpoint-full-{}'.format(global_step)) 
+
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+        # Load a trained Encoder model and vocabulary
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+        model_encoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+        # Load a trained Decoder model and vocabulary
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        model_decoder.to(args.device)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+        # Load full model
+        checkpoint = torch.load(os.path.join(output_full_dir, 'training.bin'))
+
+        
+    else:
+        # Load BERT and GPT weights (As an alternaive, one may train a VAE for this small)
+
+        ## Encoder 
+        encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+        encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+        tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+        model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+        # model_encoder.to(args.device)
+
+        ## Decoder 
+        decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+        decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+        tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+        if args.block_size <= 0:
+            args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+        args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+        
+        if args.latent_as_gpt_emb + args.latent_as_gpt_memory == 0:
+            return # latent vector should pass into GPT to decode 
+        else: 
+            latent_as_gpt_emb = True if args.latent_as_gpt_emb == 1 else False
+            latent_as_gpt_memory = True if args.latent_as_gpt_memory == 1 else False
+
+        setattr(decoder_config, "latent_size", args.latent_size)
+        model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size, latent_as_gpt_emb=latent_as_gpt_emb, latent_as_gpt_memory=latent_as_gpt_memory)
+        
+    # Save the init weights of BERT and GPT-2, so that we can load from local (Some infra requires so)
+    if args.save_bert_gpt_init:
+        encoder_path = os.path.join(args.output_dir, f"initial-models-tokenization-enoder-{args.latent_size}")
+        if not os.path.exists(encoder_path): os.makedirs(encoder_path)
+        model_encoder.save_pretrained(encoder_path)
+        tokenizer_encoder.save_pretrained(encoder_path)
+
+        decoder_path = os.path.join(args.output_dir, f"initial-models-tokenization-decoder-{args.latent_size}")
+        if not os.path.exists(decoder_path): os.makedirs(decoder_path)
+        model_decoder.save_pretrained(decoder_path)
+        tokenizer_decoder.save_pretrained(decoder_path)
+
+        return 
+
+
+
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    # model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args)
+
+    # pdb.set_trace()
+    if args.use_random_weight:
+        model_vae.apply(weights_init_rondom)
+
+    if args.use_pretrained_model:
+        model_vae.load_state_dict(checkpoint['model_state_dict'])
+        logger.info("Pre-trained Optimus is successfully loaded")
+    model_vae.to(args.device) # 
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+    
+    ##############################
+    # Training
+    global_step= 0
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss, optimizer = train(args, train_dataloader, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        save_checkpoint(model_vae, optimizer, global_step, args)
+
+        
+    ##############################
+    # Evaluation the metrics of VAE models, including PPL, MI, AU
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        if global_step == 0:
+            global_step = args.gloabl_step_eval
+
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        output_full_dir    = os.path.join(args.output_dir, 'checkpoint-full-{}'.format(global_step))
+        checkpoint_dir = [output_encoder_dir, output_decoder_dir, output_full_dir]
+
+        logger.info("Evaluate the following checkpoint: %s", checkpoint_dir[-1])
+        global_step = checkpoint_dir[-1].split('-')[-1] if len(checkpoint_dir) > 1 else ""
+
+        checkpoint = torch.load(os.path.join(output_full_dir, 'training.bin'))
+        model_vae.load_state_dict(checkpoint['model_state_dict'])
+        logger.info(f"Pre-trained Optimus is successfully loaded: {output_full_dir}")
+        model_vae.to(args.device)
+
+        result = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+        result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+        results.update(result)
+
+        output_eval_file = os.path.join(args.output_dir, "eval_vae_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results *****")
+            for key in sorted(results.keys()):
+                logger.info("%s = %s", key, str(results[key]))
+                writer.write("%s = %s\n" % (key, str(results[key])))
+        logger.info(f"The testing results are successfully saved: {output_eval_file}")
+
+    ##############################
+    #  Evaluate the reconstruction loss for each checkpoints; 
+    # This is used in studying two different latent vector injection schemes
+    results = {}
+    if args.do_eval_rec and args.local_rank in [-1, 0]:
+        if global_step == 0:
+            global_step = args.gloabl_step_eval
+            # eval_steps = range(500, 13500, 500)
+            # eval_steps = range(1000, 2000, 500)
+            eval_steps = range(2000, 32000, 2000)
+
+        checkpoints = []
+        for e in eval_steps:
+            output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(e))
+            output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(e))
+            checkpoints.append([output_encoder_dir, output_decoder_dir])
+
+        
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint[0].split('-')[-1] if len(checkpoints) > 1 else ""
+
+            model_encoder = encoder_model_class.from_pretrained(checkpoint[0], latent_size=args.latent_size)
+            model_encoder.to(args.device)     
+     
+            model_decoder = decoder_model_class.from_pretrained(checkpoint[1])
+            model_decoder.to(args.device)
+
+            model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+
+            result = evaluate_rec(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+            result = dict((k + '_test_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+            result = evaluate_rec(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='train')
+            result = dict((k + '_train_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+            
+            # pdb.set_trace()
+
+        output_eval_file = os.path.join(args.output_dir, "eval_rec_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results *****")
+            for key in sorted(results.keys()):
+                logger.info("%s = %s", key, str(results[key]))
+                writer.write("%s = %s\n" % (key, str(results[key])))
+        logger.info(f"The testing results are successfully saved: {output_eval_file}")
+
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_lm_vae_training_old.py b/Optimus/code/examples/big_ae/run_lm_vae_training_old.py
new file mode 100755
index 0000000000000000000000000000000000000000..9069b7d47a592d4aad0f65aa5e82296eb9859ae8
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_lm_vae_training_old.py
@@ -0,0 +1,784 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+
+from azure.cosmosdb.table.tableservice import TableService
+from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, TextDataset_Split, TextDataset_2Tokenizers, frange_cycle_linear, frange_cycle_zero_linear)
+
+
+from modules import VAE
+
+
+logging.getLogger("azure").setLevel(logging.WARNING)
+logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+ts = TableService(account_name=storage_name, account_key=key)
+
+
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+
+
+
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        dataset = TextDataset_2Tokenizers(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    else:
+        dataset = TextDataset_Split(tokenizer, args, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    indices_random = indices_random
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataset, model_vae, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_vae.encoder,  model_vae.decoder, model_vae.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_vae.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_vae.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_vae, optimizer = amp.initialize(model_vae, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_vae = torch.nn.DataParallel(model_vae, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_vae = torch.nn.parallel.DistributedDataParallel(model_vae, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    model_vae = model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training   
+    
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+
+    model_vae.zero_grad()
+
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=0.0, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+
+    tmp_list = []
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            tokenized_text0, tokenized_text1, tokenized_text_lengths = batch
+            # tokenized_text0 = tokenized_text0.to(args.device)
+            # tokenized_text1 = tokenized_text1.to(args.device)
+            # prepare input-output data for reconstruction
+
+            # pdb.set_trace()
+            max_len_values, _ = tokenized_text_lengths.max(0)
+            tokenized_text0 = tokenized_text0[:,:max_len_values[0]]
+            tokenized_text1 = tokenized_text1[:,:max_len_values[1]]
+
+            inputs, labels = mask_tokens(tokenized_text0, encoder_tokenizer, args) if args.mlm else (tokenized_text0, tokenized_text1)
+            labels = tokenized_text1
+
+            tokenized_text1 = tokenized_text1.to(args.device)
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+
+            model_vae.train()
+
+            beta_t = beta_t_list[step +  epoch*len(epoch_iterator)]
+            model_vae.args.beta = beta_t
+
+            if beta_t == 0.0:
+                model_vae.args.fb_mode = 0
+            else:
+                model_vae.args.fb_mode = 1
+            
+            if args.use_deterministic_connect:
+                model_vae.args.fb_mode = 2
+
+            loss_rec, loss_kl, loss = model_vae(inputs, labels)
+            # pdb.set_trace()
+            
+            # Chunyuan: loss_rec size is [4], while latent_z size is [12]
+            if args.n_gpu > 1:
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4))) 
+                print("EVALERR: {}%".format(loss_rec)) 
+
+            epoch_iterator.set_description(
+                (
+                    f'iter: {step +  epoch*len(epoch_iterator) }; loss: {loss.item():.3f}; '
+                    f'loss_rec: {loss_rec.item():.3f}; loss_kl: {loss_kl.item():.3f}; '
+                    f'beta: {model_vae.args.beta:.3f}'
+                )
+            )
+
+            if global_step % 5 == 0:
+                row = {
+                        'PartitionKey': 'MILU_Rule_Rule_Template',
+                        'RowKey': str(datetime.now()),
+                        'ExpName' : args.ExpName, 
+                        'iter': str( step +  epoch*len(epoch_iterator) ),
+                        'loss': str( loss.item()),
+                        'loss_rec': str(loss_rec.item()),
+                        'loss_kl': str(loss_kl.item()),
+                        'beta': str(model_vae.args.beta)
+                    }
+                # pdb.set_trace()
+                ts.insert_entity(table_name, row)
+
+            # pdb.set_trace()
+
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()                                   
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_vae.parameters(), args.max_grad_norm)
+
+                optimizer.step()
+
+                scheduler.step()  # Update learning rate schedule
+
+                model_vae.zero_grad()
+
+                global_step += 1
+
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    
+                    # Save encoder model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+
+                    model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_encoder_to_save.save_pretrained(output_encoder_dir)
+                        torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_encoder_dir)
+
+                    # Save decoder model checkpoint
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+
+                    model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_decoder_to_save.save_pretrained(output_decoder_dir)
+                        torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+
+            
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model_vae, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test"):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    if subset == 'test':
+        eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+    elif subset == 'train':
+        eval_dataset = load_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=False)
+    logger.info("***** Running evaluation on {} dataset *****".format(subset))
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.per_gpu_eval_batch_size = 1
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    
+    model_vae.eval()
+
+    model_vae =  model_vae.module if hasattr(model_vae, 'module') else model_vae  # Take care of distributed/parallel training
+
+    mi = calc_mi(model_vae, eval_dataloader, args)
+    au = calc_au(model_vae, eval_dataloader, delta=0.01, args=args)[0]
+    ppl, elbo, nll, kl = calc_iwnll(model_vae, eval_dataloader, args, ns=100)
+
+    result = {
+        "perplexity": ppl, "elbo": elbo, "kl": kl, "nll": nll, "au": au, "mi": mi
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+
+    row = {
+            'PartitionKey': 'MILU_Rule_Rule_Template',
+            'RowKey': str(datetime.now()),
+            'ExpName' : args.ExpName, 
+            'test_perplexity': str( ppl ),
+            'test_elbo': str( elbo ),
+            'test_nll': str(nll),
+            'test_au': str(au),
+            'test_mi': str(mi)
+        }
+    # pdb.set_trace()
+    ts.insert_entity(table_name, row)
+
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="bert-base-cased", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Variational auto-encoder
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # pdb.set_trace()
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    ## Encoder 
+    encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+    encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+    tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+    model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+    # model_encoder.to(args.device)
+
+    ## Decoder 
+    decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+    decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+    tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+    model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size)
+    
+    # Chunyuan: Add Padding token to GPT2
+    special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+    num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+    print('We have added', num_added_toks, 'tokens to GPT2')
+    model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+    assert tokenizer_decoder.pad_token == '<PAD>'
+
+    # model_decoder.to(args.device)
+
+    model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+
+    # on_gpu = next(model_vae.parameters()).is_cuda
+
+    
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    global_step= 0
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataset = load_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataset, model_vae, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        # Save model checkpoint
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        if not os.path.exists(output_encoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_encoder_dir)
+        if not os.path.exists(output_decoder_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(output_decoder_dir)
+
+        logger.info("Saving encoder model checkpoint to %s", output_encoder_dir)
+        logger.info("Saving decoder model checkpoint to %s", output_decoder_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        model_encoder_to_save = model_vae.module.encoder if hasattr(model_vae, 'module') else model_vae.encoder  # Take care of distributed/parallel training
+        model_decoder_to_save = model_vae.module.decoder if hasattr(model_vae, 'module') else model_vae.decoder  # Take care of distributed/parallel training
+
+        # Good practice: save your training arguments together with the trained model
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_encoder_to_save.save_pretrained(output_encoder_dir)
+                    torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_encoder_to_save.save_pretrained(output_encoder_dir)
+            torch.save(args, os.path.join(output_encoder_dir, 'training_encoder_args.bin'))
+
+
+        if args.use_philly:
+            save_solid = False
+            while not save_solid:
+                try:
+                    model_decoder_to_save.save_pretrained(output_decoder_dir)
+                    torch.save(args, os.path.join(output_decoder_dir, 'training_decoder_args.bin'))
+                    save_solid = True
+                except:
+                    pass
+        else:
+            model_decoder_to_save.save_pretrained(output_decoder_dir)
+            torch.save(args, os.path.join(output_decoder_dir, 'training_encoder_args.bin'))
+
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+        model_encoder.to(args.device)
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+        model_decoder.to(args.device)
+        
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        if global_step == 0:
+            global_step = args.gloabl_step_eval
+
+        output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+        output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+        checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint[0].split('-')[-1] if len(checkpoints) > 1 else ""
+
+            model_encoder = encoder_model_class.from_pretrained(checkpoint[0], latent_size=args.latent_size)
+            model_encoder.to(args.device)            
+            model_decoder = decoder_model_class.from_pretrained(checkpoint[1], latent_size=args.latent_size)
+            model_decoder.to(args.device)
+
+            model_vae = VAE(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device)
+
+            result = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+            # result = evaluate(args, model_vae, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='train')
+            # result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            # results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/big_ae/run_sampling/__pycache__/metrics.cpython-37.pyc b/Optimus/code/examples/big_ae/run_sampling/__pycache__/metrics.cpython-37.pyc
new file mode 100755
index 0000000000000000000000000000000000000000..31ea6685fd46883ff68af5b483c8272f99ab906a
Binary files /dev/null and b/Optimus/code/examples/big_ae/run_sampling/__pycache__/metrics.cpython-37.pyc differ
diff --git a/Optimus/code/examples/big_ae/run_spacefusion_pretraining.py b/Optimus/code/examples/big_ae/run_spacefusion_pretraining.py
new file mode 100755
index 0000000000000000000000000000000000000000..90eddfc25fe36847e5db63815830e62e9e7079eb
--- /dev/null
+++ b/Optimus/code/examples/big_ae/run_spacefusion_pretraining.py
@@ -0,0 +1,984 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+
+import pdb
+import argparse
+import glob
+import logging
+
+import os
+import pickle
+import random
+
+import torch.nn.functional as F
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+from collections import defaultdict
+from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
+from sklearn import manifold
+import matplotlib.pyplot as plt
+# from azure.cosmosdb.table.tableservice import TableService
+# from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+
+
+
+# import sys
+# sys.path.append('./')
+# cwd = os.getcwd()
+# pt_path = os.path.join( cwd[:-4], 'pytorch_transformers')
+# sys.path.append(pt_path)
+# print(f"Pytorch Transformer {pt_path}")
+
+
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForLatentConnector, BertTokenizer,
+                                  GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+from utils import (calc_iwnll, calc_mi, calc_au, Dialog_BucketingDataLoader, TextDataset_Split, TextDataset_2Tokenizers, frange_cycle_linear, frange_cycle_zero_linear)
+
+
+from modules import SpaceFusion
+from eval_dialog_response import eval_dialog_response
+from eval_dialog_multi_response import eval_multi_ref
+
+# logging.getLogger("azure").setLevel(logging.WARNING)
+# logging.getLogger("TableService").setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2ForLatentConnector, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForLatentConnector, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+    
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+# ts = TableService(account_name=storage_name, account_key=key)
+
+
+def build_dataload_and_cache_examples(args, tokenizer, evaluate=False):
+    if isinstance(tokenizer, list):
+        if not evaluate:
+            args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+            file_path=args.train_data_file
+            use_shuffle = True
+            bucket_size = 100
+        else:
+            args.batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)  
+            file_path=args.eval_data_file
+            use_shuffle = False
+            bucket_size = 1
+
+        dataloader = Dialog_BucketingDataLoader(file_path, args.batch_size, args.max_seq_length, tokenizer, args, bucket=bucket_size, shuffle=use_shuffle)
+    else:
+        pass 
+    return dataloader
+
+
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+
+
+def dist_mat(x):
+    return euclidean_distances(x, x)
+    #return cosine_similarity(x, x)
+
+def euc_dist_mat(x):
+    n = x.shape[0]
+    mat = np.zeros((n, n))
+    for i in range(n):
+        for j in range(i + 1, n):
+            d = np.sqrt(np.sum(np.power(x[i, :] - x[j, :], 2)))
+            mat[i, j] = d
+            mat[j, i] = d
+    return mat
+
+
+def visual2D(args, model_sf, inputs_src, inputs_tgt, n=200, method='MDS', path_prefix='vis_'):
+    
+    print('>'*10 + ' calculating z, n=%i'%n)
+    model_sf.eval()
+    with torch.no_grad():
+        z_AE, z_S2S = model_sf(inputs_src[:n,:], inputs_tgt[:n,:], None, return_vec=True)
+        z = torch.cat([z_AE, z_S2S], dim=0)
+        latent = z.cpu().detach().numpy()
+    labels = ['AE','S2S']
+    
+    colors = {
+        'AE': 'r',
+        'S2S': 'b',
+        }
+
+    print('>'*10 + ' calculating dist mat')
+    # https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
+    cmap = 'bwr' #, True:'hot'}#cubehelix'#'gnuplot2'# 
+    dmat = dist_mat(latent)
+    suffix = '_dist.png'
+    f, ax = plt.subplots(figsize=(3*len(labels),2*len(labels)))
+    cax = ax.imshow(dmat, cmap=cmap)
+    f.colorbar(cax)
+
+    """
+    ticks = []
+    ticklabels = []
+    n_prev = 0
+    for i in range(n_labels):
+        ticks.append(n_prev + n/2)
+        ticklabels.append(labels[i]+'\n')
+        ticks.append(n_prev + n)
+        ticklabels.append('%i'%(n * (i+1)))
+        n_prev = n_prev + n
+    ax.set_xticks(ticks)
+    ax.set_xticklabels(ticklabels)
+    ax.xaxis.tick_top()
+    ax.set_yticks(ticks)
+    ax.set_yticklabels([s.strip('\n') for s in ticklabels])
+    """
+    path_prefix = os.path.join(args.output_dir, path_prefix)
+    plt.savefig(path_prefix + suffix)
+    plt.close()
+
+    print('>'*10 + ' runnning %s'%method)
+    if method == 'tSNE':
+        approx = manifold.TSNE(init='pca', verbose=1).fit_transform(latent)
+    elif method == 'MDS':
+        approx = manifold.MDS(2, verbose=1, max_iter=500, n_init=1).fit_transform(latent)
+    elif method == 'isomap':
+        approx = manifold.Isomap().fit_transform(latent)
+    else:
+        raise ValueError
+
+    f, ax = plt.subplots()
+    for k in labels:
+        ax.plot(np.nan, np.nan, colors[k] + '.', label=k)
+    
+    i0 = 0
+    for k in labels:
+        i1 = i0 + n
+        ax.plot(approx[i0:i1, 0], approx[i0:i1, 1], colors[k]+'.', alpha=0.5)
+        i0 = i1
+    
+    plt.legend(loc='best')
+    plt.savefig(path_prefix+'_%s.png'%method)
+
+
+def train(args, train_dataloader, model_sf, encoder_tokenizer, decoder_tokenizer, table_name):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    # train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    # train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+
+
+    # model_encoder, model_decoder, model_connector = model_sf.encoder,  model_sf.decoder, model_sf.linear
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model_sf.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model_sf.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model_sf, optimizer = amp.initialize(model_sf, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model_sf = torch.nn.DataParallel(model_sf, device_ids=range(args.n_gpu)).to(args.device)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model_sf = torch.nn.parallel.DistributedDataParallel(model_sf, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", train_dataloader.num_examples)
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+
+    model_sf.zero_grad()
+   
+    # model_sf = model_sf.module if hasattr(model_sf, 'module') else model_sf  # Take care of distributed/parallel training   
+    
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+
+    n_iter = int(args.num_train_epochs) * len(train_dataloader)
+    beta_t_list = frange_cycle_zero_linear(n_iter, start=args.beta, stop=args.beta,  n_cycle=1, ratio_increase=args.ratio_increase, ratio_zero=args.ratio_zero)
+
+    tmp_list = []
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            # if step > 5:
+            #     break
+
+            input_ids_bert_ctx, input_ids_bert, input_ids_gpt, token_lengths = batch
+
+            # if token_lengths[0,0]>512:
+            #     input_ids_bert_ctx = input_ids_bert_ctx[0,:512].unsqueeze(0)
+
+            # if token_lengths[0,1]>512:
+            #     input_ids_bert_ctx = input_ids_bert_ctx[0,:512].unsqueeze(0)
+
+
+            #logger.info(f'Conxtext in Bert, Length {token_lengths[0]} ; Tokens: {input_ids_bert_ctx}')
+            #logger.info(f'Response in Bert, Length {token_lengths[1]} ; Tokens: {input_ids_bert}')
+            #logger.info(f'Response in GPT2, Length {token_lengths[2]} ; Tokens: {input_ids_gpt}')
+
+            #pdb.set_trace()
+            model_sf.train()
+            beta_t = beta_t_list[step +  epoch*len(epoch_iterator)]
+            model_sf.module.args.beta = beta_t
+
+
+            """
+            xiag: not sure about fb_mode yet
+
+            if beta_t == 0.0:
+                model_sf.args.fb_mode = 0
+            else:
+                model_sf.args.fb_mode = 1
+            
+            if args.use_deterministic_connect:
+                model_sf.args.fb_mode = 2
+                """
+        
+            input_ids_bert_ctx = input_ids_bert_ctx.to(args.device)
+            input_ids_bert = input_ids_bert.to(args.device)
+            input_ids_gpt = input_ids_gpt.to(args.device)
+
+            loss_rec, loss_kl, loss = model_sf(input_ids_bert_ctx, input_ids_bert, input_ids_gpt)
+            
+
+            # the following is copied from run_lm_vae_pretraining.py
+
+            # Chunyuan: loss_rec size is [4], while latent_z size is [12]
+            if args.n_gpu > 1:
+                loss_rec = loss_rec.mean()  # mean() to average on multi-gpu parallel training
+                loss_kl = loss_kl.mean()
+                loss = loss.mean()
+
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4))) 
+                print("EVALERR: {}%".format(loss_rec)) 
+
+            epoch_iterator.set_description(
+                (
+                    f'iter: {step +  epoch*len(epoch_iterator) }; loss: {loss.mean().item():.3f}; '
+                    f'loss_rec: {loss_rec.mean().item():.3f}; loss_kl: {loss_kl.mean().item():.3f}; '
+                    f'beta: {model_sf.module.args.beta:.3f}'
+                )
+            )
+
+            if global_step % 5 == 0:
+                row = {
+                        'PartitionKey': 'MILU_Rule_Rule_Template',
+                        'RowKey': str(datetime.now()),
+                        'ExpName' : args.ExpName, 
+                        'iter': str( step +  epoch*len(epoch_iterator) ),
+                        'loss': str( loss.mean().item()),
+                        'loss_rec': str(loss_rec.mean().item()),
+                        'loss_kl': str(loss_kl.mean().item()),
+                        'beta': str(model_sf.module.args.beta)
+                    }
+                # pdb.set_trace()
+                #ts.insert_entity(table_name, row)
+
+            # pdb.set_trace()
+
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()                                   
+            else:
+                loss = loss.mean()
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model_sf.parameters(), args.max_grad_norm)
+
+                optimizer.step()
+
+                scheduler.step()  # Update learning rate schedule
+
+                model_sf.zero_grad()
+
+                global_step += 1
+
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model_sf, encoder_tokenizer, decoder_tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    
+                    # Save encoder model checkpoint
+                    output_encoder_dir = os.path.join(args.output_dir, 'checkpoint-encoder-{}'.format(global_step))
+
+                    if not os.path.exists(output_encoder_dir):
+                        os.makedirs(output_encoder_dir)
+
+                    model_encoder_to_save = model_sf.module.encoder if hasattr(model_sf, 'module') else model_sf.encoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_encoder_to_save.save_pretrained(output_encoder_dir)
+                                torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_encoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_encoder_to_save.save_pretrained(output_encoder_dir)
+                        torch.save(args, os.path.join(output_encoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_encoder_dir)
+
+                    # Save decoder model checkpoint
+                    output_decoder_dir = os.path.join(args.output_dir, 'checkpoint-decoder-{}'.format(global_step))
+
+                    if not os.path.exists(output_decoder_dir):
+                        os.makedirs(output_decoder_dir)
+
+                    model_decoder_to_save = model_sf.module.decoder if hasattr(model_sf, 'module') else model_sf.decoder  # Take care of distributed/parallel training
+                    if args.use_philly:
+                        save_solid = False
+                        while not save_solid:
+                            try:
+                                model_decoder_to_save.save_pretrained(output_decoder_dir)
+                                torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                                logger.info("Saving model checkpoint to %s", output_decoder_dir)
+                                save_solid = True
+                            except:
+                                pass
+                    else:
+                        model_decoder_to_save.save_pretrained(output_decoder_dir)
+                        torch.save(args, os.path.join(output_decoder_dir, 'training_args.bin'))
+                        logger.info("Saving model checkpoint to %s", output_decoder_dir)
+
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+
+            
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step#, tr_loss / global_step
+
+
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+
+
+
+def top_k_top_p_filtering_mb(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        # scatter sorted tensors to original indexing
+        indices_to_remove = sorted_indices_to_remove.scatter(dim=1, index=sorted_indices, src=sorted_indices_to_remove)
+        logits[indices_to_remove] = filter_value
+    return logits
+
+def sample_sequence_conditional(model, length, context, past=None, num_samples=1, temperature=1, top_k=0, top_p=0.0, device='cpu', decoder_tokenizer=None):
+    
+    generated = context
+    with torch.no_grad():
+        while True:
+        # for _ in trange(length):
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][:, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering_mb(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token), dim=1)
+
+            # pdb.set_trace()
+            if next_token.unsqueeze(0)[0,0].item() == decoder_tokenizer.encode('<EOS>')[0] or generated.shape[1] > length :
+                break
+
+        # gpt_eos_id = decoder_tokenizer.encode('<EOS>')[0]
+        # idx = (generated == gpt_eos_id).nonzero().squeeze()
+
+    # pdb.set_trace()
+    return generated
+
+
+def evaluate(args, model_sf, encoder_tokenizer, decoder_tokenizer, table_name, prefix="", subset="test"):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    logger.info("***** Running evaluation on {} dataset *****".format(subset))
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    # args.per_gpu_eval_batch_size = 1
+    args.n_gpu = 1
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    eval_dataloader = build_dataload_and_cache_examples(args, [encoder_tokenizer, decoder_tokenizer], evaluate=True)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataloader))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    
+    model_sf.eval()
+
+    count = 0
+    result = []
+
+    epoch_iterator = tqdm(eval_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+    for step, batch in enumerate(epoch_iterator):
+        input_ids_bert_ctx, input_ids_bert, input_ids_gpt, token_lengths = batch
+
+        input_ids_bert_ctx = input_ids_bert_ctx.to(args.device)
+        input_ids_bert = input_ids_bert.to(args.device)
+        input_ids_gpt = input_ids_gpt.to(args.device)
+
+        if len(input_ids_bert_ctx[0,:])>512:
+            input_ids_bert_ctx = input_ids_bert_ctx[0,-512:].unsqueeze(0)
+        
+        # else: 
+        #     continue
+
+        # pdb.set_trace()
+
+        # if step == 0:
+        #     input_ids_bert_ctx_previous = input_ids_bert_ctx
+        # else:
+        #     # pdb.set_trace()
+        #     if (input_ids_bert_ctx_previous.shape == input_ids_bert_ctx.shape) and torch.eq(input_ids_bert_ctx_previous, input_ids_bert_ctx)[0].type(torch.float).mean().item() == 1.0:
+        #         continue
+        #     else:
+        #         input_ids_bert_ctx_previous = input_ids_bert_ctx
+        #         print(step)
+
+        
+        context_tokens = decoder_tokenizer.encode('<BOS>')
+        context_tokens = torch.tensor(context_tokens, dtype=torch.long, device=args.device)
+        context_tokens = context_tokens.unsqueeze(0).repeat(token_lengths.shape[0], 1)
+
+        with torch.no_grad():
+
+            text_src = encoder_tokenizer.decode(input_ids_bert_ctx[0,:].tolist(), clean_up_tokenization_spaces=False)
+            text_src = "".join(text_src)
+
+            text_ref = encoder_tokenizer.decode(input_ids_bert[0,:].tolist(), clean_up_tokenization_spaces=False)
+            text_ref = "".join(text_ref)
+
+            for i in range(args.sents_per_cxt):
+                latent_z = model_sf.sent2latent(input_ids_bert_ctx)
+
+                out = sample_sequence_conditional(
+                    model=model_sf.decoder,
+                    context=context_tokens,
+                    past=latent_z,
+                    length=256, # Chunyuan: Fix length; or use <EOS> to complete a sentence
+                    temperature=args.temperature,
+                    top_k=args.top_k,
+                    top_p=args.top_p,
+                    device=args.device,
+                    decoder_tokenizer = decoder_tokenizer
+                )
+                text_hpy = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=False)
+                
+                text_hpy = text_hpy.split()[1:-1]
+                text_hpy = ' '.join(text_hpy) + '\n'
+
+                textline = "\t".join([text_src, text_ref, text_hpy])
+                # pdb.set_trace()
+                result.append(textline)
+
+
+            epoch_iterator.set_description(
+                (
+                    f'step: {step}'
+                )
+            )           
+
+        count += 1
+        if args.total_sents>0 and count>args.total_sents:
+            break   
+
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_text_generation_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for res in result:
+            # logger.info("%s \n" % res)
+            writer.write("%s \n" % res)
+
+    return result
+
+
+
+
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")                    
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument("--dataset", default=None, type=str, help="The dataset.")
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to run text generation.")
+    parser.add_argument("--eval_generated_text_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a generated text file).")                    
+    parser.add_argument("--ExpName", default="", type=str,
+                        help="The experiment name used in Azure Table.")
+
+    ## Encoder options
+    parser.add_argument("--encoder_model_type", default="bert", type=str,
+                        help="The encoder model architecture to be fine-tuned.")
+    parser.add_argument("--encoder_model_name_or_path", default="bert-base-uncased", type=str,
+                        help="The encoder model checkpoint for weights initialization.")
+    parser.add_argument("--encoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--encoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Decoder options
+    parser.add_argument("--decoder_model_type", default="gpt2", type=str,
+                        help="The decoder model architecture to be fine-tuned.")
+    parser.add_argument("--decoder_model_name_or_path", default="gpt2", type=str,
+                        help="The decoder model checkpoint for weights initialization.")
+    parser.add_argument("--decoder_config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--decoder_tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+
+    ## Space Fusion
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument("--use_deterministic_connect", action='store_true',
+                        help="Use deterministic inference to generate latent codes, i.e., standard auto-encoders.")
+    parser.add_argument("--use_pretrained_model", action='store_true',
+                        help="Use pre-trained auto-encoder models as the initialization")
+    parser.add_argument("--use_pretrained_vae", action='store_true',
+                        help="Use use_pretrained_vae as initialization, where beta value is specified in the folder")
+    parser.add_argument("--num_s2s_bert_layer", default=1, type=int, help="Number of BERT layer used for S2S loass in space fusion.")
+    parser.add_argument("--num_frozen_bert_layer", default=11, type=int, help="Number of BERT layer used for S2S loass in space fusion")
+                        
+    parser.add_argument('--freeze_bert', action='store_true')
+    parser.add_argument('--n_pnt', type=int, default=200)
+    parser.add_argument('--path_ids', type=str, default='dailydialog_data_1000.pt')
+
+
+    ## Objective functions
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+    parser.add_argument("--beta", type=float, default=1.0,
+                        help="The weighting hyper-parameter of the KL term in VAE")
+
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--max_seq_length", default=512, type=int,
+                        help="Optional input sequence length before tokenization. The sequence will be dropped if it is longer the max_seq_length")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_generation", action='store_true',
+                        help="Whether to run text generation on the dev set.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--do_vis", action='store_true',
+                        help="Whether to run visualization on the latent space.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+
+    # Training Schedules
+    parser.add_argument("--ratio_increase", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the annealing stage.") 
+    parser.add_argument("--ratio_zero", default=0.25, type=float,
+                        help="Learning schedule, the percentage for the pure auto-encoding stage.")     
+    parser.add_argument("--fb_mode", default=0, type=int,
+                        help="free bit training mode.")   
+    parser.add_argument("--dim_target_kl", default=3.0, type=float,
+                        help="dim_target_kl free bit training mode.")                            
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    ## IO: Logging and Saving
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    # Text Generation
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--total_sents", default=10, type=int, help="Total sentences to test recontruction.")
+    parser.add_argument("--sents_per_cxt", default=10, type=int, help="Number of responses to generate for a given context.")
+
+
+    # Precision & Distributed Training 
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.decoder_model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    args.ExpName = 'Vae_' + args.dataset + '_Nz_' + str(args.latent_size)  + '_Beta_'  + str(args.beta) + '_Dkl_' + str(args.dim_target_kl) + '_Ra_' + str(args.ratio_increase) + '_R0_' + str(args.ratio_zero) 
+    table_name = 'Vae' + args.dataset + 'Nz' + str(args.latent_size) 
+    try: 
+        ts.create_table(table_name)
+    except:
+        pass
+
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    if args.do_train or args.do_generation or args.do_vis: 
+        if args.use_pretrained_model:
+
+            args.encoder_model_type = args.encoder_model_type.lower()
+            args.decoder_model_type = args.decoder_model_type.lower()
+
+            global_step = args.gloabl_step_eval
+            if args.use_pretrained_vae:
+                output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}-1.0'.format(global_step))
+                output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}-1.0'.format(global_step)) 
+            else:
+                output_encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+                output_decoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-decoder-{}'.format(global_step)) 
+
+            checkpoints = [ [output_encoder_dir, output_decoder_dir] ]
+            logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+            # Load a trained Encoder model and vocabulary
+            encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+            model_encoder = encoder_model_class.from_pretrained(output_encoder_dir, latent_size=args.latent_size)
+            tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+
+            model_encoder.to(args.device)
+            if args.block_size <= 0:
+                args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+            args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+
+            # Load a trained Decoder model and vocabulary
+            decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+            model_decoder = decoder_model_class.from_pretrained(output_decoder_dir, latent_size=args.latent_size)
+            tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+            model_decoder.to(args.device)
+            if args.block_size <= 0:
+                args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+            args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+
+        else:
+            ## Encoder 
+            encoder_config_class, encoder_model_class, encoder_tokenizer_class = MODEL_CLASSES[args.encoder_model_type]
+            encoder_config = encoder_config_class.from_pretrained(args.encoder_config_name if args.encoder_config_name else args.encoder_model_name_or_path)
+            tokenizer_encoder = encoder_tokenizer_class.from_pretrained(args.encoder_tokenizer_name if args.encoder_tokenizer_name else args.encoder_model_name_or_path, do_lower_case=args.do_lower_case)
+            if args.block_size <= 0:
+                args.block_size = tokenizer_encoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+            args.block_size = min(args.block_size, tokenizer_encoder.max_len_single_sentence)
+            model_encoder = encoder_model_class.from_pretrained(args.encoder_model_name_or_path, from_tf=bool('.ckpt' in args.encoder_model_name_or_path), config=encoder_config, latent_size=args.latent_size)
+            # model_encoder.to(args.device)
+
+            ## Decoder 
+            decoder_config_class, decoder_model_class, decoder_tokenizer_class = MODEL_CLASSES[args.decoder_model_type]
+            decoder_config = decoder_config_class.from_pretrained(args.decoder_config_name if args.decoder_config_name else args.decoder_model_name_or_path)
+            tokenizer_decoder = decoder_tokenizer_class.from_pretrained(args.decoder_tokenizer_name if args.decoder_tokenizer_name else args.decoder_model_name_or_path, do_lower_case=args.do_lower_case)
+            if args.block_size <= 0:
+                args.block_size = tokenizer_decoder.max_len_single_sentence  # Our input block size will be the max possible for the model
+            args.block_size = min(args.block_size, tokenizer_decoder.max_len_single_sentence)
+            setattr(decoder_config, "latent_size", args.latent_size)
+            model_decoder = decoder_model_class.from_pretrained(args.decoder_model_name_or_path, from_tf=bool('.ckpt' in args.decoder_model_name_or_path), config=decoder_config, latent_size=args.latent_size, latent_as_gpt_emb=False)
+            
+        # Chunyuan: Add Padding token to GPT2
+        special_tokens_dict = {'pad_token': '<PAD>', 'bos_token': '<BOS>', 'eos_token': '<EOS>'}
+        num_added_toks = tokenizer_decoder.add_special_tokens(special_tokens_dict)
+        print('We have added', num_added_toks, 'tokens to GPT2')
+        model_decoder.resize_token_embeddings(len(tokenizer_decoder))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+        assert tokenizer_decoder.pad_token == '<PAD>'
+
+        model_sf = SpaceFusion(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    
+    # Training
+    if args.do_train:
+        global_step= 0
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataloader = build_dataload_and_cache_examples(args, [tokenizer_encoder, tokenizer_decoder], evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step = train(args, train_dataloader, model_sf, tokenizer_encoder, tokenizer_decoder, table_name)
+        logger.info(" global_step = %s", global_step)
+
+    # Text Generation based on a trained model
+    if args.do_generation and args.local_rank in [-1, 0]:
+        results = {}
+        model_sf = SpaceFusion(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+        result = evaluate(args, model_sf, tokenizer_encoder, tokenizer_decoder, table_name, prefix=global_step, subset='test')
+
+    # Evaluation
+    if args.do_eval and args.local_rank in [-1, 0]:
+
+        if args.dataset == "dailydialog":
+            results = eval_dialog_response(args.eval_generated_text_file)
+        else:
+            results = eval_multi_ref(args.eval_generated_text_file, args.eval_data_file)
+
+
+        output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results *****")
+            for key in sorted(results.keys()):
+                logger.info("%s = %s", key, str(results[key]))
+                writer.write("%s = %s\n" % (key, str(results[key])))
+
+    # Visualization of the latent space
+    if args.do_vis and args.local_rank in [-1, 0]:
+
+        print('>'*10 + ' loading ids')
+        ids = torch.load(args.path_ids)
+        inputs_src = ids['input_ids_bert_ctx']
+        inputs_tgt = ids['input_ids_bert']
+        model_sf = SpaceFusion(model_encoder, model_decoder, tokenizer_encoder, tokenizer_decoder, args).to(args.device) # 
+        visual2D(args, model_sf, inputs_src, inputs_tgt, n=args.n_pnt)
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/Optimus/code/examples/big_ae/utils.py b/Optimus/code/examples/big_ae/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..4230a9b07f7a31ee208e47963900fdc8f839775b
--- /dev/null
+++ b/Optimus/code/examples/big_ae/utils.py
@@ -0,0 +1,1459 @@
+import numpy as np
+import os, sys
+import torch
+from torch import nn, optim
+import subprocess
+from tqdm import tqdm, trange
+from torch.utils.data import DataLoader, Dataset, Sampler, SequentialSampler, RandomSampler
+from torch.nn.utils.rnn import pad_sequence
+
+import json
+import pdb
+
+import torch.nn.init as init
+
+import glob
+import logging
+import pickle
+import random
+from torch.utils.data.distributed import DistributedSampler
+
+logger = logging.getLogger(__name__)
+
+
+
+class Meter(object):
+    '''Meters provide a way to keep track of important statistics in an online manner.
+    This class is abstract, but provides a standard interface for all meters to follow.
+    '''
+
+    def reset(self):
+        '''Resets the meter to default settings.'''
+        pass
+
+    def add(self, value):
+        '''Log a new value to the meter
+        Args:
+            value: Next restult to include.
+        '''
+        pass
+
+    def value(self):
+        '''Get the value of the meter in the current state.'''
+        pass
+
+class AverageValueMeter(Meter):
+    def __init__(self):
+        super(AverageValueMeter, self).__init__()
+        self.reset()
+        self.val = 0
+
+    def add(self, value, n=1):
+        self.val = value
+        self.sum += value
+        self.var += value * value
+        self.n += n
+
+        if self.n == 0:
+            self.mean, self.std = np.nan, np.nan
+        elif self.n == 1:
+            self.mean = 0.0 + self.sum  # This is to force a copy in torch/numpy
+            self.std = np.inf
+            self.mean_old = self.mean
+            self.m_s = 0.0
+        else:
+            self.mean = self.mean_old + (value - n * self.mean_old) / float(self.n)
+            self.m_s += (value - self.mean_old) * (value - self.mean)
+            self.mean_old = self.mean
+            self.std = np.sqrt(self.m_s / (self.n - 1.0))
+
+    def value(self):
+        return self.mean, self.std
+
+    def reset(self):
+        self.n = 0
+        self.sum = 0.0
+        self.var = 0.0
+        self.val = 0.0
+        self.mean = np.nan
+        self.mean_old = 0.0
+        self.m_s = 0.0
+        self.std = np.nan
+
+
+
+class BucketSampler(Sampler):
+    def __init__(self, lens, bucket_size, batch_size, droplast=False, shuffle=True):
+        self._lens = lens
+        self._batch_size = batch_size
+        self._bucket_size = bucket_size
+        self._droplast = droplast
+        self._shuf = shuffle
+
+    def __iter__(self):
+        ids = list(range(len(self._lens)))
+        if self._shuf:
+            random.shuffle(ids)
+        buckets = [sorted(ids[i:i+self._bucket_size],
+                          key=lambda i: self._lens[i], reverse=True)
+                   for i in range(0, len(ids), self._bucket_size)]
+        # buckets = [ids[i:i+self._bucket_size] for i in range(0, len(ids), self._bucket_size)]          
+        batches = [bucket[i:i+self._batch_size]
+                   for bucket in buckets
+                   for i in range(0, len(bucket), self._batch_size)]
+        if self._droplast:
+            batches = [batch for batch in batches
+                       if len(batch) == self._batch_size]
+        if self._shuf:
+            random.shuffle(batches)
+        return iter(batches)
+
+    def __len__(self):
+        bucket_sizes = ([self._bucket_size]
+                        * (len(self._lens) // self._bucket_size)
+                        + [len(self._lens) % self._bucket_size])
+        if self._droplast:
+            return sum(s//self._batch_size for s in bucket_sizes)
+        else:
+            return sum(math.ceil(s/self._batch_size) for s in bucket_sizes)
+
+
+class FeatureDataset(Dataset):
+    def __init__(self, features, max_len=None):
+        self.features = features
+        self.max_len = max_len  # this max_len do truncate
+
+    def __getitem__(self, i):
+        feat_dict = self.features[i]
+        feat = InputFeatures(**feat_dict)
+        return feat
+
+    def __len__(self):
+        return len(self.features)
+
+    @staticmethod
+    def collate(features):
+        input_ids_bert = pad_sequence([torch.tensor(f.input_ids_bert, dtype=torch.long) for f in features], batch_first=True, padding_value=0)
+        input_ids_gpt = pad_sequence([torch.tensor(f.input_ids_gpt, dtype=torch.long) for f in features], batch_first=True, padding_value=0)
+        lm_labels = pad_sequence([torch.tensor(f.input_ids_gpt, dtype=torch.long) for f in features], batch_first=True, padding_value=-1)
+        return (input_ids_bert, input_ids_gpt, lm_labels)
+
+class BucketingDataLoader(object):
+    def __init__(self, file_path, batch_size, max_seq_length, tokenizer, args, bucket=100, shuffle=True):
+
+        self.dataset = TokenDataset(tokenizer, args, file_path, block_size=args.block_size)
+        self.batch_size = batch_size
+        self.max_len = max_seq_length
+        self.bucket_size = bucket * batch_size
+        self.shuffle = shuffle
+        self.num_examples = len(self.dataset.examples)
+        self.num_batches = self.num_examples//batch_size
+        self.example_lengths = [example['bert_token_length'] for example in self.dataset.examples]
+
+    def __iter__(self):
+        sampler = BucketSampler(self.example_lengths, self.bucket_size, self.batch_size, droplast=True, shuffle=self.shuffle)
+        loader = DataLoader(self.dataset, batch_sampler=sampler, num_workers=0, collate_fn=TokenDataset.collate)
+        yield from loader
+
+    def __len__(self):
+        return self.num_batches
+
+    def __del__(self):
+        pass
+
+
+class Dialog_BucketingDataLoader(object):
+    def __init__(self, file_path, batch_size, max_seq_length, tokenizer, args, bucket=100, shuffle=True):
+
+        self.dataset = Dialog_TokenDataset(tokenizer, args, file_path, block_size=args.block_size)
+        self.batch_size = batch_size
+        self.max_len = max_seq_length
+        self.bucket_size = bucket * batch_size
+        self.shuffle = shuffle
+        self.num_examples = len(self.dataset.examples)
+        self.num_batches = self.num_examples//batch_size
+        self.example_lengths = [example['bert_token_length'] for example in self.dataset.examples]
+
+    def __iter__(self):
+        sampler = BucketSampler(self.example_lengths, self.bucket_size, self.batch_size, droplast=True, shuffle=self.shuffle)
+        loader = DataLoader(self.dataset, batch_sampler=sampler, num_workers=0, collate_fn=Dialog_TokenDataset.collate)
+        yield from loader
+
+    def __len__(self):
+        return self.num_batches
+
+    def __del__(self):
+        pass
+
+
+
+class MultipleFiles_DataLoader(object):
+    def __init__(self, file_path, batch_size, max_seq_length, tokenizer, args, bucket=100, shuffle=True, use_tensor=True):
+
+
+        self.batch_size = batch_size
+        self.max_len = max_seq_length
+        self.bucket_size = bucket * batch_size
+        self.shuffle = shuffle
+        self.file_path = file_path
+        self.tokenizer = tokenizer
+        self.args = args
+        self.use_tensor=use_tensor
+
+        # prepare for the first file
+        self.file_idx = 0
+        self.cached_features_file = os.path.join(self.file_path, args.dataset.lower()+f'.segmented.nltk.split.seq64.{self.file_idx}.json' )
+        self.dataset = PreparedTokenDataset(tokenizer, self.args, self.cached_features_file, block_size=self.args.block_size)
+        self.num_examples = len(self.dataset.examples)
+        self.num_batches = self.num_examples//batch_size
+        self.example_lengths = [example['bert_token_length'] for example in self.dataset.examples]
+
+
+    def __iter__(self):
+        
+        sampler = BucketSampler(self.example_lengths, self.bucket_size, self.batch_size, droplast=True, shuffle=self.shuffle)
+        loader = DataLoader(self.dataset, batch_sampler=sampler, num_workers=0, collate_fn=PreparedTokenDataset.collate if self.use_tensor else PreparedTokenDataset.get_examples )
+        yield from loader
+
+        # update file name for next file
+        self.file_idx += 1
+        self.cached_features_file = os.path.join(self.file_path, self.args.dataset.lower()+f'.segmented.nltk.split.seq64.{self.file_idx}.json' )
+        self.dataset = PreparedTokenDataset(self.tokenizer, self.args, self.cached_features_file, block_size=self.args.block_size)
+        self.num_examples = len(self.dataset.examples)
+        self.num_batches = self.num_examples//self.batch_size
+        self.example_lengths = [example['bert_token_length'] for example in self.dataset.examples]
+
+
+    def __len__(self):
+        return self.num_batches
+
+    def __del__(self):
+        pass
+
+    def reset(self):
+        self.file_idx = 0
+
+
+# When the dataset is too big, we can divide it into multiple small files.
+# This class is used load multiple files.
+class BucketingMultipleFiles_DataLoader(object):
+    def __init__(self, file_path, batch_size, max_seq_length, tokenizer, args, bucket=100, shuffle=True):
+
+        self.batch_size = batch_size
+        self.max_len = max_seq_length
+        self.bucket_size = bucket * batch_size
+        self.shuffle = shuffle
+        self.file_path = file_path
+        self.tokenizer = tokenizer
+        self.args = args
+
+        # prepare for the first file
+        self.file_idx = 0
+        self.cached_features_file = os.path.join(self.file_path, args.dataset.lower()+f'.segmented.nltk.split.seq64.{self.file_idx}.json' )
+        self.dataset = PreparedTokenDataset(tokenizer, self.args, self.cached_features_file, block_size=self.args.block_size)
+        self.num_examples = len(self.dataset.examples)
+        self.num_batches = self.num_examples//batch_size
+        self.example_lengths = [example['bert_token_length'] for example in self.dataset.examples]
+
+
+    def __iter__(self):
+        
+        # sampler = BucketSampler(self.example_lengths, self.bucket_size, self.batch_size, droplast=True, shuffle=self.shuffle)
+        # loader = DataLoader(self.dataset, batch_sampler=sampler, num_workers=0, collate_fn=PreparedTokenDataset.collate)
+
+        # distributed
+        sampler = DistributedSampler(self.dataset)
+        loader = DataLoader(self.dataset, sampler=sampler, batch_size=self.batch_size, pin_memory=True, num_workers=0, collate_fn=PreparedTokenDataset.collate)
+        yield from loader
+
+        # update file name for next file
+        self.file_idx += 1
+        self.cached_features_file = os.path.join(self.file_path, self.args.dataset.lower()+f'.segmented.nltk.split.seq64.{self.file_idx}.json' )
+        self.dataset = PreparedTokenDataset(self.tokenizer, self.args, self.cached_features_file, block_size=self.args.block_size)
+        self.num_examples = len(self.dataset.examples)
+        self.num_batches = self.num_examples//self.batch_size
+        self.example_lengths = [example['bert_token_length'] for example in self.dataset.examples]
+
+
+    def __len__(self):
+        return self.num_batches
+
+    def __del__(self):
+        pass
+
+    def reset(self):
+        self.file_idx = 0
+
+
+class PreparedTokenDataset(Dataset):
+    def __init__(self, tokenizers, args, cached_features_file='train', text_split_mode='natural', block_size=512):
+        logger.info(cached_features_file)
+        assert os.path.isfile(cached_features_file)
+
+        self.examples = []
+        self.tokenizers = tokenizers
+
+        # Bert tokenizer special tokens
+        self.bert_pad_token=tokenizers[0].convert_tokens_to_ids([tokenizers[0].pad_token])[0]
+
+        # GPT-2 tokenizer special tokens
+        self.gpt2_pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+        self.gpt2_bos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].bos_token])[0]
+        self.gpt2_eos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].eos_token])[0]
+
+        global bert_pad_token
+        global gpt2_pad_token
+        bert_pad_token = self.bert_pad_token
+        gpt2_pad_token = self.gpt2_pad_token
+
+        if args.dataset == 'Yahoo' or args.dataset == 'Penn' or args.dataset == 'Snli' or args.dataset == 'Debug' or args.dataset == 'wikipedia':
+            label_on = False
+        elif args.dataset == 'Yelp':
+            label_on = True
+
+        logger.info("Loading features from cached file %s", cached_features_file)
+        with open(cached_features_file, 'r') as handle:
+            self.examples = json.load(handle)
+
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return self.examples[item]
+
+
+    @staticmethod
+    def get_examples(examples):
+        token_lengths = torch.tensor( [[f['bert_token_length'], f['gpt2_token_length']] for f in examples] , dtype=torch.long)
+        return examples, token_lengths
+
+
+    @staticmethod
+    def collate(examples):
+        # Convert to Tensors and build dataset
+        input_ids_bert = pad_sequence([torch.tensor(f['bert_token'], dtype=torch.long) for f in examples], batch_first=True, padding_value=bert_pad_token)
+        input_ids_gpt = pad_sequence([torch.tensor(f['gpt2_token'], dtype=torch.long) for f in examples], batch_first=True, padding_value=gpt2_pad_token)
+        token_lengths = torch.tensor( [[f['bert_token_length'], f['gpt2_token_length']] for f in examples] , dtype=torch.long)
+
+        return (input_ids_bert, input_ids_gpt, token_lengths)
+
+
+class TokenDataset(Dataset):
+    def __init__(self, tokenizers, args, file_path='train', text_split_mode='natural', block_size=512):
+
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename[:-4]}.json')
+
+        self.examples = []
+        self.tokenizers = tokenizers
+
+        # Bert tokenizer special tokens
+        self.bert_pad_token=tokenizers[0].convert_tokens_to_ids([tokenizers[0].pad_token])[0]
+
+        # GPT-2 tokenizer special tokens
+        self.gpt2_pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+        self.gpt2_bos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].bos_token])[0]
+        self.gpt2_eos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].eos_token])[0]
+
+        global bert_pad_token
+        global gpt2_pad_token
+        bert_pad_token = self.bert_pad_token
+        gpt2_pad_token = self.gpt2_pad_token
+ 
+        if args.dataset == 'Yelp':
+            label_on = True
+        else: 
+            label_on = False
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'r') as handle:
+                self.examples = json.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            dropped, count = self._read_corpus_natural_split(fname=file_path, label=label_on, max_length=block_size, block_size=block_size, args=args)
+            
+            logger.info("The number of dropped sentences is %d", dropped)
+            logger.info("The number of processed sentences is %d", count)
+
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            if args.use_philly:
+                save_solid = False
+                while not save_solid:
+                    try:           
+                        with open(cached_features_file, 'w') as handle:
+                            json.dump(self.examples, handle)
+                    except:
+                        pass
+            else:
+                with open(cached_features_file, 'w') as handle:
+                    json.dump(self.examples, handle)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return self.examples[item]
+
+    @staticmethod
+    def collate(examples):
+        # Convert to Tensors and build dataset
+        input_ids_bert = pad_sequence([torch.tensor(f['bert_token'], dtype=torch.long) for f in examples], batch_first=True, padding_value=bert_pad_token)
+        input_ids_gpt = pad_sequence([torch.tensor(f['gpt2_token'], dtype=torch.long) for f in examples], batch_first=True, padding_value=gpt2_pad_token)
+        token_lengths = torch.tensor( [[f['bert_token_length'], f['gpt2_token_length']] for f in examples] , dtype=torch.long)
+
+        return (input_ids_bert, input_ids_gpt, token_lengths)
+
+    def _read_corpus_natural_split(self, fname, label, max_length, block_size, args):
+        data = []
+        labels = [] if label else None
+        dropped = 0
+        count = 0
+
+        with open(fname) as fin:
+            for line in fin:
+                if label:
+                    split_line = line.split('\t')
+                    lb = split_line[0]
+                    split_line_text = split_line[1]
+                else:
+                    split_line_text = line
+                    split_line_text = split_line_text.strip()
+
+                if len(split_line_text.split()) < 1:
+                    dropped += 1
+                    continue
+
+                if max_length:
+                    if len(split_line_text.split()) > max_length:
+                        dropped += 1
+                        continue
+
+                if label:
+                    labels.append(lb)
+
+                tokenized_text0 = self.tokenizers[0].convert_tokens_to_ids(self.tokenizers[0].tokenize(split_line_text))
+                tokenized_text0 = self.tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                tokenized_text0_length = len(tokenized_text0) 
+
+                tokenized_text1 = self.tokenizers[1].convert_tokens_to_ids(self.tokenizers[1].tokenize(split_line_text))
+                tokenized_text1 = self.tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1 = [self.gpt2_bos_token] + tokenized_text1 + [self.gpt2_eos_token]
+                tokenized_text1_length = len(tokenized_text1)
+
+                example = {
+                    'bert_token': tokenized_text0,
+                    'bert_token_length':tokenized_text0_length,
+                    'gpt2_token':tokenized_text1,
+                    'gpt2_token_length': tokenized_text1_length
+                }
+                self.examples.append(example)
+                count +=1
+
+        return dropped, count
+
+
+
+
+
+class Dialog_TokenDataset(Dataset):
+    def __init__(self, tokenizers, args, file_path='train', text_split_mode='natural', block_size=512):
+
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename[:-4]}.json')
+
+        self.examples = []
+        self.tokenizers = tokenizers
+
+        # Bert tokenizer special tokens
+        self.bert_pad_token=tokenizers[0].convert_tokens_to_ids([tokenizers[0].pad_token])[0]
+
+        # GPT-2 tokenizer special tokens
+        self.gpt2_pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+        self.gpt2_bos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].bos_token])[0]
+        self.gpt2_eos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].eos_token])[0]
+
+        global bert_pad_token
+        global gpt2_pad_token
+        bert_pad_token = self.bert_pad_token
+        gpt2_pad_token = self.gpt2_pad_token
+
+        if args.dataset == 'Yelp':
+            label_on = True
+        else:
+            label_on = False
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'r') as handle:
+                self.examples = json.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            dropped, count = self._read_dialog_corpus_natural_split(fname=file_path, label=label_on, max_length=block_size, block_size=block_size, args=args)
+
+            logger.info("The number of dropped sentences is %d", dropped)
+            logger.info("The number of processed sentences is %d", count)
+
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            if args.use_philly:
+                save_solid = False
+                while not save_solid:
+                    try:           
+                        with open(cached_features_file, 'w') as handle:
+                            json.dump(self.examples, handle)
+                    except:
+                        pass
+            else:
+                with open(cached_features_file, 'w') as handle:
+                    json.dump(self.examples, handle)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return self.examples[item]
+
+    @staticmethod
+    def collate(examples):
+        # Convert to Tensors and build dataset
+        input_ids_bert_ctx = pad_sequence([torch.tensor(f['bert_token_ctx'], dtype=torch.long) for f in examples], batch_first=True, padding_value=bert_pad_token)
+        input_ids_bert = pad_sequence([torch.tensor(f['bert_token'], dtype=torch.long) for f in examples], batch_first=True, padding_value=bert_pad_token)
+        input_ids_gpt = pad_sequence([torch.tensor(f['gpt2_token'], dtype=torch.long) for f in examples], batch_first=True, padding_value=gpt2_pad_token)
+        token_lengths = torch.tensor( [[f['bert_token_ctx_length'], f['bert_token_length'], f['gpt2_token_length']] for f in examples] , dtype=torch.long)
+
+        return (input_ids_bert_ctx, input_ids_bert, input_ids_gpt, token_lengths)
+
+    def _read_dialog_corpus_natural_split(self, fname, label, max_length, block_size, args):
+        data = []
+        labels = [] if label else None
+        dropped = 0
+        count = 0
+
+        with open(fname) as fin:
+            for line in fin:
+
+                split_line_text = line
+                split_line_text = split_line_text.strip()
+
+                if len(split_line_text.split()) < 1:
+                    dropped += 1
+                    continue
+
+                # if max_length:
+                #     if len(split_line_text.split()) > max_length:
+                #         dropped += 1
+                #         continue
+
+                context_text, response_text = split_line_text.split('\t')
+
+                tokenized_text_ctx = self.tokenizers[0].convert_tokens_to_ids(self.tokenizers[0].tokenize(context_text))
+                tokenized_text_ctx = self.tokenizers[0].add_special_tokens_single_sentence(tokenized_text_ctx)
+                
+                if len(tokenized_text_ctx)>512:
+                    tokenized_text_ctx = tokenized_text_ctx[-512:]
+                    # pdb.set_trace()
+                tokenized_text_ctx_length = len(tokenized_text_ctx) 
+
+                tokenized_text0 = self.tokenizers[0].convert_tokens_to_ids(self.tokenizers[0].tokenize(response_text))
+                tokenized_text0 = self.tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                if len(tokenized_text0)>512:
+                    tokenized_text0 = tokenized_text0[-512:]
+                    
+                tokenized_text0_length = len(tokenized_text0) 
+
+                tokenized_text1 = self.tokenizers[1].convert_tokens_to_ids(self.tokenizers[1].tokenize(response_text))
+                tokenized_text1 = self.tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1 = [self.gpt2_bos_token] + tokenized_text1 + [self.gpt2_eos_token]
+                tokenized_text1_length = len(tokenized_text1)
+
+                # pdb.set_trace()
+                example = {
+                    'bert_token_ctx': tokenized_text_ctx,
+                    'bert_token_ctx_length':tokenized_text_ctx_length,
+                    'bert_token': tokenized_text0,
+                    'bert_token_length':tokenized_text0_length,
+                    'gpt2_token':tokenized_text1,
+                    'gpt2_token_length': tokenized_text1_length
+                }
+                self.examples.append(example)
+                count +=1
+
+        return dropped, count
+
+
+
+
+
+
+class TextDataset_Split(Dataset):
+    def __init__(self, tokenizer, args, file_path='train', text_split_mode='natural', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_{block_size}_{filename}')
+
+        self.examples = []
+        self.tokenizer = tokenizer
+
+        # GPT tokenizers
+        self.pad_token_id=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0]
+        self.bos_token_id=tokenizer.convert_tokens_to_ids([tokenizer.bos_token])[0]
+        self.eos_token_id=tokenizer.convert_tokens_to_ids([tokenizer.eos_token])[0]
+
+        if args.dataset == 'Yelp':
+            label_on = True
+        else:
+            label_on = False 
+        
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            if text_split_mode == 'block':
+                self._read_corpus_block_split(fname=file_path, block_size = block_size)
+            elif text_split_mode == 'natural': 
+                self._read_corpus_natural_split(fname=file_path, label=label_on, max_length=block_size, block_size=block_size)
+            else:
+                print('Please specify the mode to split the raw text')
+
+            # pdb.set_trace()
+
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        # pdb.set_trace()
+        # Convert to Tensors and build dataset
+        tokenized_text1= torch.tensor(self.examples[item][0], dtype=torch.long)
+        tokenized_text_lengths = torch.tensor([self.examples[item][1]], dtype=torch.long)
+        # pdb.set_trace()
+        return (tokenized_text1, tokenized_text_lengths)
+
+    def _read_corpus_natural_split(self, fname, label, max_length, block_size):
+        data = []
+        labels = [] if label else None
+        dropped = 0
+        
+
+
+        with open(fname) as fin:
+            for line in fin:
+
+                if label:
+                    split_line = line.split('\t')
+                    lb = split_line[0]
+                    split_line_text = split_line[1]
+                else:
+                    split_line_text = line
+
+                if len(split_line_text) < 1:
+                    dropped += 1
+                    continue
+
+                if max_length:
+                    if len(split_line_text.split()) > max_length:
+                        dropped += 1
+                        continue
+
+                if label:
+                    labels.append(lb)
+
+                tokenized_text1 = self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(split_line_text))
+                tokenized_text1 = self.tokenizer.add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1_length = len(tokenized_text1)
+                
+                tokenized_text1 = [self.bos_token_id] + tokenized_text1 + [self.eos_token_id]
+                tokenized_text1 = tokenized_text1 + ([self.pad_token_id] *  (block_size - tokenized_text1_length - 2) ) # Pad up to the sequence length.
+                assert len(tokenized_text1) == block_size
+
+                self.examples.append([tokenized_text1, tokenized_text1_length])
+                
+
+                    
+
+    def _read_corpus_block_split(self, fname, block_size):
+
+        with open(fname, encoding="utf-8") as f:
+            text = f.read()
+
+        # Chunyuan: divide the linguistic text into the same length, then different tokenization schemes are applied
+        while len(text) >= block_size:  # Truncate in block of block_size
+
+            tokenized_text1 = self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(text[:block_size]))
+            tokenized_text1 = self.tokenizer.add_special_tokens_single_sentence(tokenized_text1)
+            tokenized_text1_length = len(tokenized_text1)
+
+            tokenized_text1 = [bos_token_id] + tokenized_text1 + [eos_token_id]
+            tokenized_text1 = tokenized_text1 + ([pad_token_id] *  (block_size - tokenized_text1_length - 2) ) # Pad up to the sequence length.
+            assert len(tokenized_text1) == block_size
+
+            self.examples.append([tokenized_text1, tokenized_text1_length])
+            text = text[block_size:]
+
+
+
+
+
+class TextDataset_2Tokenizers_LCtrlG(Dataset):
+    def __init__(self, tokenizers, args, file_path='train', text_split_mode='natural', block_size=512, create_new=0):
+        print(file_path)
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename}')
+
+        self.examples = []
+        self.tokenizers = tokenizers
+
+        # GPT tokenizers
+        self.pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+        self.bos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].bos_token])[0]
+        self.eos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].eos_token])[0]
+
+        if not create_new and os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            if text_split_mode == 'natural':
+                if args.dataset == 'Yelp':
+                    dropped = self._read_corpus_natural_split_yelp(fname=file_path, label=True, max_length=block_size, block_size=block_size)
+                    logger.info("The number of dropped sentences is %d", dropped)
+                elif args.dataset == 'yahoo':
+                    pass
+                else:
+                    raise NotImplementedError
+            else:
+                raise ValueError('Please specify the mode to split the raw text')
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        # pdb.set_trace()
+        # Convert to Tensors and build dataset
+        tokenized_text0= torch.tensor(self.examples[item][0], dtype=torch.long)
+        tokenized_text1= torch.tensor(self.examples[item][2], dtype=torch.long)
+        tokenized_text_lengths = torch.tensor([self.examples[item][1], self.examples[item][3]], dtype=torch.long)
+        label = torch.tensor(self.examples[item][4], dtype=torch.long)
+
+        # pdb.set_trace()
+        return (tokenized_text0, tokenized_text1, tokenized_text_lengths, label)
+
+    def get_labels(self):
+        return ['0', '1']
+
+    def _read_corpus_natural_split_yelp(self, fname, label, max_length, block_size):
+        # label: the file contains labels.
+        dropped = 0
+        label_fname = fname.replace('.text', '.labels')
+
+        with open(fname) as fin, open(label_fname) as lfin:
+            for line, label_line in zip(fin, lfin):
+                # pdb.set_trace()
+                split_line_text = line
+                lb = int(label_line)
+                assert lb in [0, 1]   # binary sentiment in yelp dataset.
+
+                if len(split_line_text) < 1:
+                    dropped += 1
+                    continue
+
+                if max_length:
+                    if len(split_line_text.split()) > max_length:
+                        dropped += 1
+                        continue
+
+                # tokenize by tokenizers[0]
+                tokenized_text0 = self.tokenizers[0].convert_tokens_to_ids(self.tokenizers[0].tokenize(split_line_text))
+                tokenized_text0 = self.tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                tokenized_text0_length = len(tokenized_text0)
+                pad_token=self.tokenizers[0].convert_tokens_to_ids([self.tokenizers[0].pad_token])[0]
+                # pad to max_seq_length (block_size)
+                if block_size > tokenized_text0_length:
+                    tokenized_text0 = tokenized_text0 + ([pad_token] * (block_size - tokenized_text0_length)  ) # Pad up to the sequence length.
+                else:
+                    dropped += 1
+                    continue
+                assert len(tokenized_text0) == block_size
+
+                # tokenize by tokenizers[1]
+                tokenized_text1 = self.tokenizers[1].convert_tokens_to_ids(self.tokenizers[1].tokenize(split_line_text))
+                tokenized_text1 = self.tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1 = [self.bos_token] + tokenized_text1 + [self.eos_token]
+                tokenized_text1_length = len(tokenized_text1)
+                # pad to max_seq_length (block_size)
+                if block_size > tokenized_text1_length:
+                    tokenized_text1 = tokenized_text1 + ([self.pad_token] *  (block_size - tokenized_text1_length) ) # Pad up to the sequence length.
+                else:
+                    dropped += 1
+                    continue
+                assert len(tokenized_text1) == block_size
+
+                self.examples.append([tokenized_text0, tokenized_text0_length, tokenized_text1, tokenized_text1_length, lb])
+
+        return dropped
+
+
+class TextDataset_2Tokenizers(Dataset):
+    def __init__(self, tokenizers, args, file_path='train', text_split_mode='natural', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_gpt_bert_{block_size}_{filename}')
+
+        self.examples = []
+        self.tokenizers = tokenizers
+
+        # GPT tokenizers
+        self.pad_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].pad_token])[0]
+        self.bos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].bos_token])[0]
+        self.eos_token=tokenizers[1].convert_tokens_to_ids([tokenizers[1].eos_token])[0]
+
+        if args.dataset == 'Yelp':
+            label_on = True
+        else:
+            label_on = False 
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            if text_split_mode == 'block':
+                self._read_corpus_block_split(fname=file_path, block_size = block_size)
+            elif text_split_mode == 'natural': 
+                dropped, count = self._read_corpus_natural_split(fname=file_path, label=label_on, max_length=block_size, block_size=block_size, args=args)
+                logger.info("The number of dropped sentences is %d", dropped)
+                logger.info("The number of used sentences is %d", count)
+            else:
+                print('Please specify the mode to split the raw text')
+
+            # pdb.set_trace()
+
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            if args.use_philly:
+                save_solid = False
+                while not save_solid:
+                    try:           
+                        with open(cached_features_file, 'wb') as handle:
+                            pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+                    except:
+                        pass
+            else:
+                with open(cached_features_file, 'wb') as handle:
+                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        # pdb.set_trace()
+        # Convert to Tensors and build dataset
+        tokenized_text0= torch.tensor(self.examples[item][0], dtype=torch.long)
+        tokenized_text1= torch.tensor(self.examples[item][2], dtype=torch.long)
+        tokenized_text_lengths = torch.tensor([self.examples[item][1], self.examples[item][3]], dtype=torch.long)
+        
+        # pdb.set_trace()
+        return (tokenized_text0, tokenized_text1, tokenized_text_lengths)
+
+    def _read_corpus_natural_split(self, fname, label, max_length, block_size, args):
+        data = []
+        labels = [] if label else None
+        dropped = 0
+        count = 0
+
+        with open(fname) as fin:
+            for line in fin:
+                # pdb.set_trace()
+
+                if label:
+                    split_line = line.split('\t')
+                    lb = split_line[0]
+                    split_line_text = split_line[1]
+                else:
+                    split_line_text = line
+
+                if len(split_line_text.split()) < 1:
+                    dropped += 1
+                    continue
+
+                if max_length:
+                    if len(split_line_text.split()) > max_length:
+                        dropped += 1
+                        continue
+
+                if label:
+                    labels.append(lb)
+
+                tokenized_text0 = self.tokenizers[0].convert_tokens_to_ids(self.tokenizers[0].tokenize(split_line_text))
+                tokenized_text0 = self.tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+                tokenized_text0_length = len(tokenized_text0) 
+                pad_token=self.tokenizers[0].convert_tokens_to_ids([self.tokenizers[0].pad_token])[0]
+                if block_size>tokenized_text0_length:
+                    tokenized_text0 = tokenized_text0 + ([pad_token] * (block_size - tokenized_text0_length)  ) # Pad up to the sequence length.
+                else:
+                    dropped += 1
+                    continue   
+
+                assert len(tokenized_text0) == block_size
+                
+                tokenized_text1 = self.tokenizers[1].convert_tokens_to_ids(self.tokenizers[1].tokenize(split_line_text))
+                tokenized_text1 = self.tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+                tokenized_text1 = [self.bos_token] + tokenized_text1 + [self.eos_token]
+                tokenized_text1_length = len(tokenized_text1)
+                
+                if block_size>tokenized_text1_length:
+                    tokenized_text1 = tokenized_text1 + ([self.pad_token] *  (block_size - tokenized_text1_length) ) # Pad up to the sequence length.
+                else:
+                    dropped += 1
+                    continue                 
+                
+                assert len(tokenized_text1) == block_size
+
+                self.examples.append([tokenized_text0, tokenized_text0_length, tokenized_text1, tokenized_text1_length])
+
+                count +=1
+                # if args.dataset == 'wikipedia' and count==10: 
+                #     break
+
+        return dropped, count
+
+    def _read_corpus_block_split(self, fname, block_size):
+
+        with open(fname, encoding="utf-8") as f:
+            text = f.read()
+
+        # Chunyuan: divide the linguistic text into the same length, then different tokenization schemes are applied
+        while len(text) >= block_size:  # Truncate in block of block_size
+
+            tokenized_text0 = self.tokenizers[0].convert_tokens_to_ids(self.tokenizers[0].tokenize(text[:block_size]))
+            tokenized_text0 = self.tokenizers[0].add_special_tokens_single_sentence(tokenized_text0)
+            tokenized_text0_length = len(tokenized_text0) 
+            pad_token=self.tokenizers[0].convert_tokens_to_ids([self.tokenizers[0].pad_token])[0]
+            tokenized_text0 = tokenized_text0 + ([pad_token] * (block_size - tokenized_text0_length)  ) # Pad up to the sequence length.
+            assert len(tokenized_text0) == block_size
+            
+            tokenized_text1 = self.tokenizers[1].convert_tokens_to_ids(self.tokenizers[1].tokenize(text[:block_size]))
+            tokenized_text1 = self.tokenizers[1].add_special_tokens_single_sentence(tokenized_text1)
+            tokenized_text1_length = len(tokenized_text1)
+
+            
+            tokenized_text1 = [bos_token] + tokenized_text1 + [eos_token]
+            tokenized_text1 = tokenized_text1 + ([pad_token] *  (block_size - tokenized_text1_length - 2) ) # Pad up to the sequence length.
+            assert len(tokenized_text1) == block_size
+
+            self.examples.append([tokenized_text0, tokenized_text0_length, tokenized_text1, tokenized_text1_length])
+            text = text[block_size:]
+
+
+def frange_cycle_linear(n_iter, start=0.0, stop=1.0,  n_cycle=4, ratio=0.5):
+    L = np.ones(n_iter) * stop
+    period = n_iter/n_cycle
+    step = (stop-start)/(period*ratio) # linear schedule
+
+    for c in range(n_cycle):
+        v, i = start, 0
+        while v <= stop and (int(i+c*period) < n_iter):
+            L[int(i+c*period)] = v
+            v += step
+            i += 1
+    return L 
+
+def frange_cycle_zero_linear(n_iter, start=0.0, stop=1.0,  n_cycle=4, ratio_increase=0.5, ratio_zero=0.3):
+    L = np.ones(n_iter) * stop
+    period = n_iter/n_cycle
+    step = (stop-start)/(period*ratio_increase) # linear schedule
+
+    for c in range(n_cycle):
+        v, i = start, 0
+        while v <= stop and (int(i+c*period) < n_iter):
+            if i < period*ratio_zero:
+                L[int(i+c*period)] = start
+            else: 
+                L[int(i+c*period)] = v
+                v += step
+            i += 1
+    return L 
+
+
+class uniform_initializer(object):
+        def __init__(self, stdv):
+            self.stdv = stdv
+        def __call__(self, tensor):
+            nn.init.uniform_(tensor, -self.stdv, self.stdv)
+
+
+class xavier_normal_initializer(object):
+    def __call__(self, tensor):
+        nn.init.xavier_normal_(tensor)
+
+def reconstruct(model, test_data_batch, vocab, strategy, fname):
+    hyps = []
+    refs = []
+    with open(fname, "w") as fout:
+        #for i in range(10):
+            # batch_data = test_data_batch[i]
+
+        for batch_data in test_data_batch:
+            decoded_batch = model.reconstruct(batch_data, strategy)
+
+            source = [[vocab.id2word(id_.item()) for id_ in sent] for sent in batch_data]
+            for j in range(len(batch_data)):
+                ref = " ".join(source[j])
+                hyp = " ".join(decoded_batch[j])
+                fout.write("SOURCE: {}\n".format(ref))
+                fout.write("RECON: {}\n\n".format(hyp))
+
+                refs += [ref[len("<s>"): -len("</s>")]]
+                if strategy == "beam":
+                    hyps += [hyp[len("<s>"): -len("</s>")]]
+                else:
+                    hyps += [hyp[: -len("</s>")]]
+
+    fname_ref = fname + ".ref"
+    fname_hyp = fname + ".hyp"
+    with open(fname_ref, "w") as f:
+        f.write("\n".join(refs))
+    with open(fname_hyp, "w") as f:
+        f.write("\n".join(hyps))
+    call_multi_bleu_perl("scripts/multi-bleu.perl", fname_hyp, fname_ref, verbose=True)
+
+
+
+
+def calc_iwnll(model_vae, eval_dataloader, args, ns=20):
+
+    eval_loss = 0.0
+    ############ Perplexity ############
+    report_kl_loss = report_rec_loss = report_loss = 0
+    report_num_words = report_num_sents = 0
+
+    for batch in tqdm(eval_dataloader, desc="Evaluating PPL"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x1 = x1[:,:max_len_values[1]]
+
+        x0 = x0.to(args.device)
+        x1 = x1.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+
+        # pdb.set_trace()
+        # not predict start symbol
+        report_num_words += x_lengths[:,1].sum().item()
+        report_num_sents += args.eval_batch_size
+
+        with torch.no_grad():
+            loss, loss_rc, loss_kl = model_vae.loss_iw(x0, x1, nsamples=100, ns=5)
+
+        loss_rc = loss_rc.sum()
+        loss_kl = loss_kl.sum()
+        loss = loss.sum()
+
+        report_rec_loss += loss_rc.item()
+        report_kl_loss += loss_kl.item()
+        report_loss += loss.item()
+
+        # pdb.set_trace()
+        
+    test_loss = report_loss / report_num_sents
+    
+    elbo = (report_kl_loss - report_rec_loss) / report_num_sents
+    nll  = - report_rec_loss / report_num_sents
+    kl   = report_kl_loss / report_num_sents
+    ppl  = np.exp(-report_loss / report_num_words)
+
+    return ppl, elbo, nll, kl
+
+
+
+def calc_rec(model_vae, eval_dataloader, args, ns=1):
+
+    eval_loss = 0.0
+    ############ Perplexity ############
+    report_kl_loss = report_rec_loss = report_loss = 0
+    report_num_words = report_num_sents = 0
+
+    i = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating PPL"):
+        # pdb.set_trace()
+        x0, x1, x_lengths = batch
+
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x1 = x1[:,:max_len_values[1]]
+
+        x0 = x0.to(args.device)
+        x1 = x1.to(args.device)
+        x_lengths = x_lengths.to(args.device)
+
+        # pdb.set_trace()
+        # not predict start symbol
+        report_num_words += x_lengths[:,1].sum().item()
+        report_num_sents += args.eval_batch_size
+
+        with torch.no_grad():
+            loss, loss_rc, loss_kl = model_vae.loss_iw(x0, x1, nsamples=1, ns=1)
+
+        loss_rc = loss_rc.sum()
+        report_rec_loss += loss_rc.item()
+
+        i += 1
+        if i > 500:
+            break
+
+
+        # pdb.set_trace()
+
+    nll_s  = - report_rec_loss / report_num_sents
+    nll_w  = - report_rec_loss / report_num_words
+
+    return nll_s, nll_w
+
+
+
+# def calc_mi(model, test_data_batch):
+#     mi = 0
+#     num_examples = 0
+#     for batch_data in test_data_batch:
+#         batch_size = batch_data.size(0)
+#         num_examples += batch_size
+#         mutual_info = model.calc_mi_q(batch_data)
+#         mi += mutual_info * batch_size
+
+#     return mi / num_examples
+
+
+
+def calc_mi(model_vae, test_data_batch, args):
+    # calc_mi_v3
+    import math 
+    from modules.utils import log_sum_exp
+
+    mi = 0
+    num_examples = 0
+
+    mu_batch_list, logvar_batch_list = [], []
+    neg_entropy = 0.
+    for batch in tqdm(test_data_batch, desc="Evaluating MI, Stage 1"):
+
+        x0, _, x_lengths = batch
+
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+
+        x0 = x0.to(args.device)
+
+        with torch.no_grad():
+            # encoding into bert features
+            bert_fea = model_vae.encoder(x0)[1]
+
+            # (batch_size, nz)
+            mu, logvar = model_vae.encoder.linear(bert_fea).chunk(2, -1)
+
+        x_batch, nz = mu.size()
+
+        #print(x_batch, end=' ')
+
+        num_examples += x_batch
+
+        # E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
+
+        neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()
+        mu_batch_list += [mu.cpu()]
+        logvar_batch_list += [logvar.cpu()]
+
+
+    neg_entropy = neg_entropy / num_examples
+    ##print()
+
+    num_examples = 0
+    log_qz = 0.
+    for i in tqdm(range(len(mu_batch_list)), desc="Evaluating MI, Stage 2"):
+
+        ###############
+        # get z_samples
+        ###############
+        mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+        
+        # [z_batch, 1, nz]
+        with torch.no_grad():
+            z_samples = model_vae.reparameterize(mu, logvar, 1)
+
+        z_samples = z_samples.view(-1, 1, nz)
+        num_examples += z_samples.size(0)
+
+        ###############
+        # compute density
+        ###############
+        # [1, x_batch, nz]
+        #mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+        #indices = list(np.random.choice(np.arange(len(mu_batch_list)), 10)) + [i]
+        indices = np.arange(len(mu_batch_list))
+        mu = torch.cat([mu_batch_list[_] for _ in indices], dim=0).cuda()
+        logvar = torch.cat([logvar_batch_list[_] for _ in indices], dim=0).cuda()
+        x_batch, nz = mu.size()
+
+        mu, logvar = mu.unsqueeze(0), logvar.unsqueeze(0)
+        var = logvar.exp()
+
+        # (z_batch, x_batch, nz)
+        dev = z_samples - mu
+
+        # (z_batch, x_batch)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+        # log q(z): aggregate posterior
+        # [z_batch]
+        log_qz += (log_sum_exp(log_density, dim=1) - math.log(x_batch)).sum(-1)
+
+    log_qz /= num_examples
+    mi = neg_entropy - log_qz
+
+    return mi.item()
+
+
+
+
+
+def calc_au(model_vae, eval_dataloader, args, delta=0.01):
+    """compute the number of active units
+    """
+    cnt = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating AU, Stage 1"):
+
+        x0, _, x_lengths = batch
+        max_len_values, _ = x_lengths.max(0)
+        x0 = x0[:,:max_len_values[0]]
+        x0 = x0.to(args.device)
+
+        with torch.no_grad():
+            # encoding into bert features
+            bert_fea = model_vae.encoder(x0)[1]
+
+            # (batch_size, nz)
+            mean, logvar = model_vae.encoder.linear(bert_fea).chunk(2, -1)
+
+        if cnt == 0:
+            means_sum = mean.sum(dim=0, keepdim=True)
+        else:
+            means_sum = means_sum + mean.sum(dim=0, keepdim=True)
+        cnt += mean.size(0)
+
+    # (1, nz)
+    mean_mean = means_sum / cnt
+
+    cnt = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating AU, Stage 2"):
+
+        x0, _, _ = batch
+        x0 = x0.to(args.device)
+
+        with torch.no_grad():
+            # encoding into bert features
+            bert_fea = model_vae.encoder(x0)[1]
+
+            # (batch_size, nz)
+            mean, _ = model_vae.encoder.linear(bert_fea).chunk(2, -1)
+
+        if cnt == 0:
+            var_sum = ((mean - mean_mean) ** 2).sum(dim=0)
+        else:
+            var_sum = var_sum + ((mean - mean_mean) ** 2).sum(dim=0)
+        cnt += mean.size(0)
+
+    # (nz)
+    au_var = var_sum / (cnt - 1)
+
+    # pdb.set_trace()
+    return (au_var >= delta).sum().item(), au_var
+
+
+def sample_sentences(vae, vocab, device, num_sentences):
+    global logging
+
+    vae.eval()
+    sampled_sents = []
+    for i in range(num_sentences):
+        z = vae.sample_from_prior(1)
+        z = z.view(1,1,-1)
+        start = vocab.word2id['<s>']
+        # START = torch.tensor([[[start]]])
+        START = torch.tensor([[start]])
+        end = vocab.word2id['</s>']
+        START = START.to(device)
+        z = z.to(device)
+        vae.eval()
+        sentence = vae.decoder.sample_text(START, z, end, device)
+        decoded_sentence = vocab.decode_sentence(sentence)
+        sampled_sents.append(decoded_sentence)
+    for i, sent in enumerate(sampled_sents):
+        logging(i,":",' '.join(sent))
+
+# def visualize_latent(args, vae, device, test_data):
+#     f = open('yelp_embeddings_z','w')
+#     g = open('yelp_embeddings_labels','w')
+
+#     test_data_batch, test_label_batch = test_data.create_data_batch_labels(batch_size=args.batch_size, device=device, batch_first=True)
+#     for i in range(len(test_data_batch)):
+#         batch_data = test_data_batch[i]
+#         batch_label = test_label_batch[i]
+#         batch_size, sent_len = batch_data.size()
+#         means, _ = vae.encoder.forward(batch_data)
+#         for i in range(batch_size):
+#             mean = means[i,:].cpu().detach().numpy().tolist()
+#             for val in mean:
+#                 f.write(str(val)+'\t')
+#             f.write('\n')
+#         for label in batch_label:
+#             g.write(label+'\n')
+#         fo
+#         print(mean.size())
+#         print(logvar.size())
+#         fooo
+
+def visualize_latent(args, epoch, vae, device, test_data):
+    nsamples = 1
+
+    with open(os.path.join(args.exp_dir, f'synthetic_latent_{epoch}.txt'),'w') as f:
+        test_data_batch, test_label_batch = test_data.create_data_batch_labels(batch_size=args.batch_size, device=device, batch_first=True)
+        for i in range(len(test_data_batch)):
+            batch_data = test_data_batch[i]
+            batch_label = test_label_batch[i]
+            batch_size, sent_len = batch_data.size()
+            samples, _ = vae.encoder.encode(batch_data, nsamples)
+            for i in range(batch_size):
+                for j in range(nsamples):
+                    sample = samples[i,j,:].cpu().detach().numpy().tolist()
+                    f.write(batch_label[i] + '\t' + ' '.join([str(val) for val in sample]) + '\n')
+
+
+def call_multi_bleu_perl(fname_bleu_script, fname_hyp, fname_ref, verbose=True):
+    cmd = "perl %s %s < %s" % (fname_bleu_script, fname_ref, fname_hyp)
+    popen = subprocess.Popen(cmd, stdout=subprocess.PIPE, \
+        stderr=subprocess.PIPE, shell=True)
+    popen.wait()
+    try:
+        bleu_result = popen.stdout.readline().strip().decode("utf-8")
+        if verbose:
+            print(bleu_result)
+        bleu = float(bleu_result[7:bleu_result.index(',')])
+        stderrs = popen.stderr.readlines()
+        if len(stderrs) > 1:
+            for line in stderrs:
+                print(line.strip())
+    except Exception as e:
+        print(e)
+        bleu = 0.
+    return bleu
+
+
+
+
+def weight_init(m):
+    '''
+    Usage:
+        model = Model()
+        model.apply(weight_init)
+    '''
+    if isinstance(m, nn.Conv1d):
+        init.normal_(m.weight.data)
+        if m.bias is not None:
+            init.normal_(m.bias.data)
+    elif isinstance(m, nn.Conv2d):
+        init.xavier_normal_(m.weight.data)
+        if m.bias is not None:
+            init.normal_(m.bias.data)
+    elif isinstance(m, nn.Conv3d):
+        init.xavier_normal_(m.weight.data)
+        if m.bias is not None:
+            init.normal_(m.bias.data)
+    elif isinstance(m, nn.ConvTranspose1d):
+        init.normal_(m.weight.data)
+        if m.bias is not None:
+            init.normal_(m.bias.data)
+    elif isinstance(m, nn.ConvTranspose2d):
+        init.xavier_normal_(m.weight.data)
+        if m.bias is not None:
+            init.normal_(m.bias.data)
+    elif isinstance(m, nn.ConvTranspose3d):
+        init.xavier_normal_(m.weight.data)
+        if m.bias is not None:
+            init.normal_(m.bias.data)
+    elif isinstance(m, nn.BatchNorm1d):
+        init.normal_(m.weight.data, mean=1, std=0.02)
+        init.constant_(m.bias.data, 0)
+    elif isinstance(m, nn.BatchNorm2d):
+        init.normal_(m.weight.data, mean=1, std=0.02)
+        init.constant_(m.bias.data, 0)
+    elif isinstance(m, nn.BatchNorm3d):
+        init.normal_(m.weight.data, mean=1, std=0.02)
+        init.constant_(m.bias.data, 0)
+    elif isinstance(m, nn.Linear):
+        init.xavier_normal_(m.weight.data)
+        init.normal_(m.bias.data)
+    elif isinstance(m, nn.LSTM):
+        for param in m.parameters():
+            if len(param.shape) >= 2:
+                init.orthogonal_(param.data)
+            else:
+                init.normal_(param.data)
+    elif isinstance(m, nn.LSTMCell):
+        for param in m.parameters():
+            if len(param.shape) >= 2:
+                init.orthogonal_(param.data)
+            else:
+                init.normal_(param.data)
+    elif isinstance(m, nn.GRU):
+        for param in m.parameters():
+            if len(param.shape) >= 2:
+                init.orthogonal_(param.data)
+            else:
+                init.normal_(param.data)
+    elif isinstance(m, nn.GRUCell):
+        for param in m.parameters():
+            if len(param.shape) >= 2:
+                init.orthogonal_(param.data)
+            else:
+                init.normal_(param.data)
+
+
+if __name__ == '__main__':
+    pass
\ No newline at end of file
diff --git a/Optimus/code/examples/contrib/README.md b/Optimus/code/examples/contrib/README.md
new file mode 100755
index 0000000000000000000000000000000000000000..f2d0616e629bcc7d7800d1a4b727e725379ac736
--- /dev/null
+++ b/Optimus/code/examples/contrib/README.md
@@ -0,0 +1,5 @@
+# Community contributed examples
+
+This folder contains examples which are not actively maintained (mostly contributed by the community).
+
+Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
diff --git a/Optimus/code/examples/contrib/run_openai_gpt.py b/Optimus/code/examples/contrib/run_openai_gpt.py
new file mode 100755
index 0000000000000000000000000000000000000000..1c9fba8ee8367b4ae514cb7c60d7b9c99004618b
--- /dev/null
+++ b/Optimus/code/examples/contrib/run_openai_gpt.py
@@ -0,0 +1,290 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" OpenAI GPT model fine-tuning script.
+    Adapted from https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py
+    It self adapted from https://github.com/openai/finetune-transformer-lm/blob/master/train.py
+
+    This script with default values fine-tunes and evaluate a pretrained OpenAI GPT on the RocStories dataset:
+        python run_openai_gpt.py \
+          --model_name openai-gpt \
+          --do_train \
+          --do_eval \
+          --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
+          --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
+          --output_dir ../log \
+          --train_batch_size 16 \
+"""
+import argparse
+import os
+import csv
+import random
+import logging
+from tqdm import tqdm, trange
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+
+from pytorch_transformers import (OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer,
+                                     AdamW, cached_path, WEIGHTS_NAME, CONFIG_NAME,
+                                     WarmupLinearSchedule)
+
+ROCSTORIES_URL = "https://s3.amazonaws.com/datasets.huggingface.co/ROCStories.tar.gz"
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+def accuracy(out, labels):
+    outputs = np.argmax(out, axis=1)
+    return np.sum(outputs == labels)
+
+def load_rocstories_dataset(dataset_path):
+    """ Output a list of tuples(story, 1st continuation, 2nd continuation, label) """
+    with open(dataset_path, encoding='utf_8') as f:
+        f = csv.reader(f)
+        output = []
+        next(f) # skip the first line
+        for line in tqdm(f):
+            output.append((' '.join(line[1:5]), line[5], line[6], int(line[-1])-1))
+    return output
+
+def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, delimiter_token, clf_token):
+    """ Pre-process datasets containing lists of tuples(story, 1st continuation, 2nd continuation, label)
+
+        To Transformer inputs of shape (n_batch, n_alternative, length) comprising for each batch, continuation:
+        input_ids[batch, alternative, :] = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
+    """
+    tensor_datasets = []
+    for dataset in encoded_datasets:
+        n_batch = len(dataset)
+        input_ids = np.zeros((n_batch, 2, input_len), dtype=np.int64)
+        mc_token_ids = np.zeros((n_batch, 2), dtype=np.int64)
+        lm_labels = np.full((n_batch, 2, input_len), fill_value=-1, dtype=np.int64)
+        mc_labels = np.zeros((n_batch,), dtype=np.int64)
+        for i, (story, cont1, cont2, mc_label), in enumerate(dataset):
+            with_cont1 = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
+            with_cont2 = [start_token] + story[:cap_length] + [delimiter_token] + cont2[:cap_length] + [clf_token]
+            input_ids[i, 0, :len(with_cont1)] = with_cont1
+            input_ids[i, 1, :len(with_cont2)] = with_cont2
+            mc_token_ids[i, 0] = len(with_cont1) - 1
+            mc_token_ids[i, 1] = len(with_cont2) - 1
+            lm_labels[i, 0, :len(with_cont1)] = with_cont1
+            lm_labels[i, 1, :len(with_cont2)] = with_cont2
+            mc_labels[i] = mc_label
+        all_inputs = (input_ids, mc_token_ids, lm_labels, mc_labels)
+        tensor_datasets.append(tuple(torch.tensor(t) for t in all_inputs))
+    return tensor_datasets
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model_name', type=str, default='openai-gpt',
+                        help='pretrained model name')
+    parser.add_argument("--do_train", action='store_true', help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true', help="Whether to run eval on the dev set.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+    parser.add_argument('--train_dataset', type=str, default='')
+    parser.add_argument('--eval_dataset', type=str, default='')
+    parser.add_argument('--seed', type=int, default=42)
+    parser.add_argument('--num_train_epochs', type=int, default=3)
+    parser.add_argument('--train_batch_size', type=int, default=8)
+    parser.add_argument('--eval_batch_size', type=int, default=16)
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument('--max_grad_norm', type=int, default=1)
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training \
+                        steps to perform. Override num_train_epochs.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before\
+                        performing a backward/update pass.")
+    parser.add_argument('--learning_rate', type=float, default=6.25e-5)
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument('--lr_schedule', type=str, default='warmup_linear')
+    parser.add_argument('--weight_decay', type=float, default=0.01)
+    parser.add_argument('--lm_coef', type=float, default=0.9)
+    parser.add_argument('--n_valid', type=int, default=374)
+
+    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
+    args = parser.parse_args()
+    print(args)
+
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    n_gpu = torch.cuda.device_count()
+    logger.info("device: {}, n_gpu {}".format(device, n_gpu))
+
+    if not args.do_train and not args.do_eval:
+        raise ValueError("At least one of `do_train` or `do_eval` must be True.")
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+
+    # Load tokenizer and model
+    # This loading functions also add new tokens and embeddings called `special tokens`
+    # These new embeddings will be fine-tuned on the RocStories dataset
+    special_tokens = ['_start_', '_delimiter_', '_classify_']
+    tokenizer = OpenAIGPTTokenizer.from_pretrained(args.model_name)
+    tokenizer.add_tokens(special_tokens)
+    special_tokens_ids = tokenizer.convert_tokens_to_ids(special_tokens)
+    model = OpenAIGPTDoubleHeadsModel.from_pretrained(args.model_name)
+    model.resize_token_embeddings(len(tokenizer))
+    model.to(device)
+
+    # Load and encode the datasets
+    if not args.train_dataset and not args.eval_dataset:
+        roc_stories = cached_path(ROCSTORIES_URL)
+    def tokenize_and_encode(obj):
+        """ Tokenize and encode a nested object """
+        if isinstance(obj, str):
+            return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
+        elif isinstance(obj, int):
+            return obj
+        return list(tokenize_and_encode(o) for o in obj)
+    logger.info("Encoding dataset...")
+    train_dataset = load_rocstories_dataset(args.train_dataset)
+    eval_dataset = load_rocstories_dataset(args.eval_dataset)
+    datasets = (train_dataset, eval_dataset)
+    encoded_datasets = tokenize_and_encode(datasets)
+
+    # Compute the max input length for the Transformer
+    max_length = model.config.n_positions // 2 - 2
+    input_length = max(len(story[:max_length]) + max(len(cont1[:max_length]), len(cont2[:max_length])) + 3  \
+                           for dataset in encoded_datasets for story, cont1, cont2, _ in dataset)
+    input_length = min(input_length, model.config.n_positions)  # Max size of input for the pre-trained model
+
+    # Prepare inputs tensors and dataloaders
+    tensor_datasets = pre_process_datasets(encoded_datasets, input_length, max_length, *special_tokens_ids)
+    train_tensor_dataset, eval_tensor_dataset = tensor_datasets[0], tensor_datasets[1]
+
+    train_data = TensorDataset(*train_tensor_dataset)
+    train_sampler = RandomSampler(train_data)
+    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    eval_data = TensorDataset(*eval_tensor_dataset)
+    eval_sampler = SequentialSampler(eval_data)
+    eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Prepare optimizer
+    if args.do_train:
+        if args.max_steps > 0:
+            t_total = args.max_steps
+            args.num_train_epochs = args.max_steps //\
+                (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+        else:
+            t_total = len(train_dataloader)\
+                // args.gradient_accumulation_steps * args.num_train_epochs
+
+        param_optimizer = list(model.named_parameters())
+        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
+        optimizer_grouped_parameters = [
+            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+            ]
+        optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+        scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+
+    if args.do_train:
+        nb_tr_steps, tr_loss, exp_average_loss = 0, 0, None
+        model.train()
+        for _ in trange(int(args.num_train_epochs), desc="Epoch"):
+            tr_loss = 0
+            nb_tr_steps = 0
+            tqdm_bar = tqdm(train_dataloader, desc="Training")
+            for step, batch in enumerate(tqdm_bar):
+                batch = tuple(t.to(device) for t in batch)
+                input_ids, mc_token_ids, lm_labels, mc_labels = batch
+                losses = model(input_ids, mc_token_ids=mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels)
+                loss = args.lm_coef * losses[0] + losses[1]
+                loss.backward()
+                scheduler.step()
+                optimizer.step()
+                optimizer.zero_grad()
+                tr_loss += loss.item()
+                exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
+                nb_tr_steps += 1
+                tqdm_bar.desc = "Training loss: {:.2e} lr: {:.2e}".format(exp_average_loss, scheduler.get_lr()[0])
+
+    # Save a trained model
+    if args.do_train:
+        # Save a trained model, configuration and tokenizer
+        model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
+
+        # If we save using the predefined names, we can load using `from_pretrained`
+        output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
+        output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
+
+        torch.save(model_to_save.state_dict(), output_model_file)
+        model_to_save.config.to_json_file(output_config_file)
+        tokenizer.save_vocabulary(args.output_dir)
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = OpenAIGPTDoubleHeadsModel.from_pretrained(args.output_dir)
+        tokenizer = OpenAIGPTTokenizer.from_pretrained(args.output_dir)
+        model.to(device)
+
+    if args.do_eval:
+        model.eval()
+        eval_loss, eval_accuracy = 0, 0
+        nb_eval_steps, nb_eval_examples = 0, 0
+        for batch in tqdm(eval_dataloader, desc="Evaluating"):
+            batch = tuple(t.to(device) for t in batch)
+            input_ids, mc_token_ids, lm_labels, mc_labels = batch
+            with torch.no_grad():
+               _, mc_loss, _, mc_logits = model(input_ids, mc_token_ids=mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels)
+
+            mc_logits = mc_logits.detach().cpu().numpy()
+            mc_labels = mc_labels.to('cpu').numpy()
+            tmp_eval_accuracy = accuracy(mc_logits, mc_labels)
+
+            eval_loss += mc_loss.mean().item()
+            eval_accuracy += tmp_eval_accuracy
+
+            nb_eval_examples += input_ids.size(0)
+            nb_eval_steps += 1
+
+        eval_loss = eval_loss / nb_eval_steps
+        eval_accuracy = eval_accuracy / nb_eval_examples
+        train_loss = tr_loss/nb_tr_steps if args.do_train else None
+        result = {'eval_loss': eval_loss,
+                  'eval_accuracy': eval_accuracy,
+                  'train_loss': train_loss}
+
+        output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results *****")
+            for key in sorted(result.keys()):
+                logger.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/contrib/run_swag.py b/Optimus/code/examples/contrib/run_swag.py
new file mode 100755
index 0000000000000000000000000000000000000000..495f40cec96331097104e7aff48e88d50dee05d4
--- /dev/null
+++ b/Optimus/code/examples/contrib/run_swag.py
@@ -0,0 +1,673 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT finetuning runner.
+   Finetuning the library models for multiple choice on SWAG (Bert).
+"""
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import logging
+import csv
+import os
+import random
+import sys
+import glob
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+from tensorboardX import SummaryWriter
+
+from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForMultipleChoice, BertTokenizer)
+
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \
+                  for conf in [BertConfig]), ())
+
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForMultipleChoice, BertTokenizer),
+}
+
+class SwagExample(object):
+    """A single training/test example for the SWAG dataset."""
+    def __init__(self,
+                 swag_id,
+                 context_sentence,
+                 start_ending,
+                 ending_0,
+                 ending_1,
+                 ending_2,
+                 ending_3,
+                 label = None):
+        self.swag_id = swag_id
+        self.context_sentence = context_sentence
+        self.start_ending = start_ending
+        self.endings = [
+            ending_0,
+            ending_1,
+            ending_2,
+            ending_3,
+        ]
+        self.label = label
+
+    def __str__(self):
+        return self.__repr__()
+
+    def __repr__(self):
+        l = [
+            "swag_id: {}".format(self.swag_id),
+            "context_sentence: {}".format(self.context_sentence),
+            "start_ending: {}".format(self.start_ending),
+            "ending_0: {}".format(self.endings[0]),
+            "ending_1: {}".format(self.endings[1]),
+            "ending_2: {}".format(self.endings[2]),
+            "ending_3: {}".format(self.endings[3]),
+        ]
+
+        if self.label is not None:
+            l.append("label: {}".format(self.label))
+
+        return ", ".join(l)
+
+class InputFeatures(object):
+    def __init__(self,
+                 example_id,
+                 choices_features,
+                 label
+
+    ):
+        self.example_id = example_id
+        self.choices_features = [
+            {
+                'input_ids': input_ids,
+                'input_mask': input_mask,
+                'segment_ids': segment_ids
+            }
+            for _, input_ids, input_mask, segment_ids in choices_features
+        ]
+        self.label = label
+
+def read_swag_examples(input_file, is_training=True):
+    with open(input_file, 'r', encoding='utf-8') as f:
+        reader = csv.reader(f)
+        lines = []
+        for line in reader:
+            if sys.version_info[0] == 2:
+                line = list(unicode(cell, 'utf-8') for cell in line)
+            lines.append(line)
+
+    if is_training and lines[0][-1] != 'label':
+        raise ValueError(
+            "For training, the input file must contain a label column."
+        )
+
+    examples = [
+        SwagExample(
+            swag_id = line[2],
+            context_sentence = line[4],
+            start_ending = line[5], # in the swag dataset, the
+                                         # common beginning of each
+                                         # choice is stored in "sent2".
+            ending_0 = line[7],
+            ending_1 = line[8],
+            ending_2 = line[9],
+            ending_3 = line[10],
+            label = int(line[11]) if is_training else None
+        ) for line in lines[1:] # we skip the line with the column names
+    ]
+
+    return examples
+
+def convert_examples_to_features(examples, tokenizer, max_seq_length,
+                                 is_training):
+    """Loads a data file into a list of `InputBatch`s."""
+
+    # Swag is a multiple choice task. To perform this task using Bert,
+    # we will use the formatting proposed in "Improving Language
+    # Understanding by Generative Pre-Training" and suggested by
+    # @jacobdevlin-google in this issue
+    # https://github.com/google-research/bert/issues/38.
+    #
+    # Each choice will correspond to a sample on which we run the
+    # inference. For a given Swag example, we will create the 4
+    # following inputs:
+    # - [CLS] context [SEP] choice_1 [SEP]
+    # - [CLS] context [SEP] choice_2 [SEP]
+    # - [CLS] context [SEP] choice_3 [SEP]
+    # - [CLS] context [SEP] choice_4 [SEP]
+    # The model will output a single value for each input. To get the
+    # final decision of the model, we will run a softmax over these 4
+    # outputs.
+    features = []
+    for example_index, example in tqdm(enumerate(examples)):
+        context_tokens = tokenizer.tokenize(example.context_sentence)
+        start_ending_tokens = tokenizer.tokenize(example.start_ending)
+
+        choices_features = []
+        for ending_index, ending in enumerate(example.endings):
+            # We create a copy of the context tokens in order to be
+            # able to shrink it according to ending_tokens
+            context_tokens_choice = context_tokens[:]
+            ending_tokens = start_ending_tokens + tokenizer.tokenize(ending)
+            # Modifies `context_tokens_choice` and `ending_tokens` in
+            # place so that the total length is less than the
+            # specified length.  Account for [CLS], [SEP], [SEP] with
+            # "- 3"
+            _truncate_seq_pair(context_tokens_choice, ending_tokens, max_seq_length - 3)
+
+            tokens = ["[CLS]"] + context_tokens_choice + ["[SEP]"] + ending_tokens + ["[SEP]"]
+            segment_ids = [0] * (len(context_tokens_choice) + 2) + [1] * (len(ending_tokens) + 1)
+
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+            input_mask = [1] * len(input_ids)
+
+            # Zero-pad up to the sequence length.
+            padding = [0] * (max_seq_length - len(input_ids))
+            input_ids += padding
+            input_mask += padding
+            segment_ids += padding
+
+            assert len(input_ids) == max_seq_length
+            assert len(input_mask) == max_seq_length
+            assert len(segment_ids) == max_seq_length
+
+            choices_features.append((tokens, input_ids, input_mask, segment_ids))
+
+        label = example.label
+        if example_index < 5:
+            logger.info("*** Example ***")
+            logger.info("swag_id: {}".format(example.swag_id))
+            for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
+                logger.info("choice: {}".format(choice_idx))
+                logger.info("tokens: {}".format(' '.join(tokens)))
+                logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
+                logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
+                logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
+            if is_training:
+                logger.info("label: {}".format(label))
+
+        features.append(
+            InputFeatures(
+                example_id = example.swag_id,
+                choices_features = choices_features,
+                label = label
+            )
+        )
+
+    return features
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+def accuracy(out, labels):
+    outputs = np.argmax(out, axis=1)
+    return np.sum(outputs == labels)
+
+def select_field(features, field):
+    return [
+        [
+            choice[field]
+            for choice in feature.choices_features
+        ]
+        for feature in features
+    ]
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Load data features from cache or dataset file
+    input_file = args.predict_file if evaluate else args.train_file
+    cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
+        'dev' if evaluate else 'train',
+        list(filter(None, args.model_name_or_path.split('/'))).pop(),
+        str(args.max_seq_length)))
+    if os.path.exists(cached_features_file) and not args.overwrite_cache and not output_examples:
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", input_file)
+        examples = read_swag_examples(input_file)
+        features = convert_examples_to_features(
+            examples, tokenizer, args.max_seq_length, not evaluate)
+
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor(select_field(features, 'input_ids'), dtype=torch.long)
+    all_input_mask = torch.tensor(select_field(features, 'input_mask'), dtype=torch.long)
+    all_segment_ids = torch.tensor(select_field(features, 'segment_ids'), dtype=torch.long)
+    all_label = torch.tensor([f.label for f in features], dtype=torch.long)
+
+    if evaluate:
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_label)
+    else:
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_label)
+
+    if output_examples:
+        return dataset, examples, features
+    return dataset
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            batch = tuple(t.to(args.device) for t in batch)
+            inputs = {'input_ids':       batch[0],
+                      'attention_mask':  batch[1],
+                      #'token_type_ids':  None if args.model_type == 'xlm' else batch[2],
+                      'token_type_ids': batch[2],
+                      'labels':         batch[3]}
+            # if args.model_type in ['xlnet', 'xlm']:
+            #     inputs.update({'cls_index': batch[5],
+            #                    'p_mask':       batch[6]})
+            outputs = model(**inputs)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_vocabulary(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+def evaluate(args, model, tokenizer, prefix=""):
+    dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
+
+    if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(dataset) if args.local_rank == -1 else DistributedSampler(dataset)
+    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+
+
+    eval_loss, eval_accuracy = 0, 0
+    nb_eval_steps, nb_eval_examples = 0, 0
+
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        model.eval()
+        batch = tuple(t.to(args.device) for t in batch)
+        with torch.no_grad():
+            inputs = {'input_ids':      batch[0],
+                      'attention_mask': batch[1],
+                      # 'token_type_ids': None if args.model_type == 'xlm' else batch[2]  # XLM don't use segment_ids
+                      'token_type_ids': batch[2],
+                      'labels':         batch[3]}
+
+            # if args.model_type in ['xlnet', 'xlm']:
+            #     inputs.update({'cls_index': batch[4],
+            #                    'p_mask':    batch[5]})
+            outputs = model(**inputs)
+            tmp_eval_loss, logits = outputs[:2]
+            eval_loss += tmp_eval_loss.mean().item()
+
+        logits = logits.detach().cpu().numpy()
+        label_ids = inputs['labels'].to('cpu').numpy()
+        tmp_eval_accuracy = accuracy(logits, label_ids)
+        eval_accuracy += tmp_eval_accuracy
+
+        nb_eval_steps += 1
+        nb_eval_examples += inputs['input_ids'].size(0)
+
+    eval_loss = eval_loss / nb_eval_steps
+    eval_accuracy = eval_accuracy / nb_eval_examples
+    result = {'eval_loss': eval_loss,
+              'eval_accuracy': eval_accuracy}
+
+    output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results *****")
+        for key in sorted(result.keys()):
+            logger.info("%s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return result
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_file", default=None, type=str, required=True,
+                        help="SWAG csv for training. E.g., train.csv")
+    parser.add_argument("--predict_file", default=None, type=str, required=True,
+                        help="SWAG csv for predictions. E.g., val.csv or test.csv")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model checkpoints and predictions will be written.")
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--max_seq_length", default=384, type=int,
+                        help="The maximum total input sequence length after tokenization. Sequences "
+                             "longer than this will be truncated, and sequences shorter than this will be padded.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Whether not to use CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="local_rank for distributed training on gpus")
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Save the trained model and the tokenizer
+    if args.local_rank == -1 or torch.distributed.get_rank() == 0:
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        model.to(args.device)
+
+
+    # Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        if args.do_train:
+            checkpoints = [args.output_dir]
+        else:
+            # if do_train is False and do_eval is true, load model directly from pretrained.
+            checkpoints = [args.model_name_or_path]
+
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce model loading logs
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+        for checkpoint in checkpoints:
+            # Reload the model
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            tokenizer = tokenizer_class.from_pretrained(checkpoint)
+            model.to(args.device)
+
+            # Evaluate
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+
+            result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items())
+            results.update(result)
+
+    logger.info("Results: {}".format(results))
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/contrib/run_transfo_xl.py b/Optimus/code/examples/contrib/run_transfo_xl.py
new file mode 100755
index 0000000000000000000000000000000000000000..4c99777b98235f62bbf060af066b9e7ccecfe36e
--- /dev/null
+++ b/Optimus/code/examples/contrib/run_transfo_xl.py
@@ -0,0 +1,153 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Transformer XL model evaluation script.
+    Adapted from https://github.com/kimiyoung/transformer-xl.
+    In particular https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/eval.py
+
+    This script with default values evaluates a pretrained Transformer-XL on WikiText 103
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import logging
+import time
+import math
+
+import torch
+
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLCorpus, TransfoXLTokenizer
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+def main():
+    parser = argparse.ArgumentParser(description='PyTorch Transformer Language Model')
+    parser.add_argument('--model_name', type=str, default='transfo-xl-wt103',
+                        help='pretrained model name')
+    parser.add_argument('--split', type=str, default='test',
+                        choices=['all', 'valid', 'test'],
+                        help='which split to evaluate')
+    parser.add_argument('--batch_size', type=int, default=10,
+                        help='batch size')
+    parser.add_argument('--tgt_len', type=int, default=128,
+                        help='number of tokens to predict')
+    parser.add_argument('--ext_len', type=int, default=0,
+                        help='length of the extended context')
+    parser.add_argument('--mem_len', type=int, default=1600,
+                        help='length of the retained previous heads')
+    parser.add_argument('--clamp_len', type=int, default=1000,
+                        help='max positional embedding index')
+    parser.add_argument('--no_cuda', action='store_true',
+                        help='Do not use CUDA even though CUA is available')
+    parser.add_argument('--work_dir', type=str, required=True,
+                        help='path to the work_dir')
+    parser.add_argument('--no_log', action='store_true',
+                        help='do not log the eval result')
+    parser.add_argument('--same_length', action='store_true',
+                        help='set same length attention with masking')
+    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
+    args = parser.parse_args()
+    assert args.ext_len >= 0, 'extended context length must be non-negative'
+
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    logger.info("device: {}".format(device))
+
+    # Load a pre-processed dataset
+    # You can also build the corpus yourself using TransfoXLCorpus methods
+    # The pre-processing involve computing word frequencies to prepare the Adaptive input and SoftMax
+    # and tokenizing the dataset
+    # The pre-processed corpus is a convertion (using the conversion script )
+    tokenizer = TransfoXLTokenizer.from_pretrained(args.model_name)
+    corpus = TransfoXLCorpus.from_pretrained(args.model_name)
+    ntokens = len(corpus.vocab)
+
+    va_iter = corpus.get_iterator('valid', args.batch_size, args.tgt_len,
+        device=device, ext_len=args.ext_len)
+    te_iter = corpus.get_iterator('test', args.batch_size, args.tgt_len,
+        device=device, ext_len=args.ext_len)
+
+    # Load a pre-trained model
+    model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
+    model = model.to(device)
+
+    logger.info('Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}'.format(
+        args.batch_size, args.tgt_len, args.ext_len, args.mem_len, args.clamp_len))
+
+    model.reset_length(args.tgt_len, args.ext_len, args.mem_len)
+    if args.clamp_len > 0:
+        model.clamp_len = args.clamp_len
+    if args.same_length:
+        model.same_length = True
+
+    ###############################################################################
+    # Evaluation code
+    ###############################################################################
+    def evaluate(eval_iter):
+        # Turn on evaluation mode which disables dropout.
+        model.eval()
+        total_len, total_loss = 0, 0.
+        start_time = time.time()
+        with torch.no_grad():
+            mems = None
+            for idx, (data, target, seq_len) in enumerate(eval_iter):
+                ret = model(data, lm_labels=target, mems=mems)
+                loss, _, mems = ret
+                loss = loss.mean()
+                total_loss += seq_len * loss.item()
+                total_len += seq_len
+            total_time = time.time() - start_time
+        logger.info('Time : {:.2f}s, {:.2f}ms/segment'.format(
+                total_time, 1000 * total_time / (idx+1)))
+        return total_loss / total_len
+
+    # Run on test data.
+    if args.split == 'all':
+        test_loss = evaluate(te_iter)
+        valid_loss = evaluate(va_iter)
+    elif args.split == 'valid':
+        valid_loss = evaluate(va_iter)
+        test_loss = None
+    elif args.split == 'test':
+        test_loss = evaluate(te_iter)
+        valid_loss = None
+
+    def format_log(loss, split):
+        log_str = '| {0} loss {1:5.2f} | {0} ppl {2:9.3f} '.format(
+            split, loss, math.exp(loss))
+        return log_str
+
+    log_str = ''
+    if valid_loss is not None:
+        log_str += format_log(valid_loss, 'valid')
+    if test_loss is not None:
+        log_str += format_log(test_loss, 'test')
+
+    logger.info('=' * 100)
+    logger.info(log_str)
+    logger.info('=' * 100)
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/debug/test_azure_db.py b/Optimus/code/examples/debug/test_azure_db.py
new file mode 100755
index 0000000000000000000000000000000000000000..7309a821ae04d541d54026ae5716edb3e66cd9c8
--- /dev/null
+++ b/Optimus/code/examples/debug/test_azure_db.py
@@ -0,0 +1,32 @@
+from azure.cosmosdb.table.tableservice import TableService
+from azure.cosmosdb.table.models import Entity
+from datetime import datetime
+
+import logging
+logger = logging.getLogger(__name__)
+
+# Setup logging
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+
+logging.getLogger("azure").setLevel(logging.WARNING)
+logging.getLogger("TableService").setLevel(logging.WARNING)
+
+
+storage_name="textae"
+key=r"6yBCXlblof8DVFJ4BD3eNFTrGQCej6cKfCf5z308cKnevyHaG+yl/m+ITVErB9yt0kvN3ToqxLIh0knJEfFmPA=="
+ts = TableService(account_name=storage_name, account_key=key)
+
+# ts.create_table('firsttable')
+table_name = 'firsttable'
+
+logger.info("Insert row into Table %s", table_name)
+
+row = {
+        'PartitionKey': 'MILU_Rule_Rule_Template',
+        'RowKey': str(datetime.now()),
+        'iter': str(1)
+    }
+
+ts.insert_entity(table_name, row)
\ No newline at end of file
diff --git a/Optimus/code/examples/distillation/README.md b/Optimus/code/examples/distillation/README.md
new file mode 100755
index 0000000000000000000000000000000000000000..12d9165536fa8bee4323823a1cb5ec00cbb785bb
--- /dev/null
+++ b/Optimus/code/examples/distillation/README.md
@@ -0,0 +1,115 @@
+# DistilBERT
+
+This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.
+
+**2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
+
+## What is DistilBERT
+
+DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+
+For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
+). *Please note that we will publish a formal write-up with updated and more complete results in the near future (September 19th).*
+
+Here's the updated results on the dev sets of GLUE:
+
+| Model      | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | STS-B | WNLI |
+| :---:      |    :---:    | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:|
+| BERT-base  |  **77.6**   | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
+| DistilBERT |  **75.2**   | 49.1 | 81.8 | 90.2 | 87.0 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
+
+## Setup
+
+This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`. 
+
+**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0). It is important to note that there is a small internal bug in the current version of PyTorch available on pip that causes a memory leak in our training/distillation. It has been recently fixed and will likely be integrated into the next release. For the moment, we recommend to [compile PyTorch from source](https://github.com/pytorch/pytorch#from-source). Please refer to [issue 1179](https://github.com/huggingface/pytorch-transformers/issues/1179) for more details.
+
+## How to use DistilBERT
+
+PyTorch-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
+
+- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
+- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
+
+Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
+
+```python
+tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+model = DistilBertModel.from_pretrained('distilbert-base-uncased')
+
+input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
+outputs = model(input_ids)
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+```
+
+## How to train DistilBERT
+
+In the following, we will explain how you can train your own compressed model.
+
+### A. Preparing the data
+
+The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as the English version of BERT).
+
+To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences).
+
+First, we will binarize the data, i.e. tokenize the data and convert each token in an index in our model's vocabulary.
+
+```bash
+python scripts/binarized_data.py \
+    --file_path data/dump.txt \
+    --bert_tokenizer bert-base-uncased \
+    --dump_file data/binarized_text
+```
+
+Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
+
+```bash
+python scripts/token_counts.py \
+    --data_file data/binarized_text.bert-base-uncased.pickle \
+    --token_counts_dump data/token_counts.bert-base-uncased.pickle
+```
+
+### B. Training
+
+Training with distillation is really simple once you have pre-processed the data:
+
+```bash
+python train.py \
+    --dump_path serialization_dir/my_first_training \
+    --data_file data/binarized_text.bert-base-uncased.pickle \
+    --token_counts data/token_counts.bert-base-uncased.pickle \
+    --force # overwrites the `dump_path` if it already exists.
+```
+
+By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
+
+We highly encourage you to use distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
+
+```bash
+export NODE_RANK=0
+export N_NODES=1
+
+export N_GPU_NODE=4
+export WORLD_SIZE=4
+export MASTER_PORT=<AN_OPEN_PORT>
+export MASTER_ADDR=<I.P.>
+
+pkill -f 'python -u train.py'
+
+python -m torch.distributed.launch \
+    --nproc_per_node=$N_GPU_NODE \
+    --nnodes=$N_NODES \
+    --node_rank $NODE_RANK \
+    --master_addr $MASTER_ADDR \
+    --master_port $MASTER_PORT \
+    train.py \
+        --force \
+        --n_gpu $WORLD_SIZE \
+        --data_file data/binarized_text.bert-base-uncased.pickle \
+        --token_counts data/token_counts.bert-base-uncased.pickle \
+        --dump_path serialization_dir/my_first_distillation
+```
+
+**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!
+
+Happy distillation!
diff --git a/Optimus/code/examples/distillation/dataset.py b/Optimus/code/examples/distillation/dataset.py
new file mode 100755
index 0000000000000000000000000000000000000000..4babf73ea43282d75e59085ae6e60f41f69cbebf
--- /dev/null
+++ b/Optimus/code/examples/distillation/dataset.py
@@ -0,0 +1,201 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Dataloaders to train DistilBERT
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
+"""
+from typing import List
+import math
+from itertools import chain
+from collections import Counter
+import numpy as np
+import torch
+
+from utils import logger
+
+class Dataset:
+    def __init__(self,
+                 params,
+                 data):
+        self.params = params
+        self.tokens_per_batch = params.tokens_per_batch
+        self.batch_size = params.batch_size
+        self.shuffle = params.shuffle
+        self.group_by_size = params.group_by_size
+
+        self.token_ids = np.array(data)
+        self.lengths = np.uint16([len(t) for t in data])
+
+        self.check()
+        self.remove_long_sequences()
+        self.remove_empty_sequences()
+        self.check()
+        self.print_statistics()
+
+    def __len__(self):
+        return len(self.lengths)
+
+    def check(self):
+        """
+        Some sanity checks
+        """
+        assert len(self.token_ids) == len(self.lengths)
+
+    def remove_long_sequences(self):
+        """
+        Sequences that are too long are splitted by chunk of max_position_embeddings.
+        """
+        indices = self.lengths >= self.params.max_position_embeddings
+        logger.info(f'Splitting {sum(indices)} too long sequences.')
+
+        def divide_chunks(l, n):
+            return [l[i:i + n] for i in range(0, len(l), n)]
+
+        new_tok_ids = []
+        new_lengths = []
+        cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
+        max_len = self.params.max_position_embeddings
+
+        for seq_, len_ in zip(self.token_ids, self.lengths):
+            if len_ <= max_len:
+                new_tok_ids.append(seq_)
+                new_lengths.append(len_)
+            else:
+                sub_seqs = []
+                for sub_s in divide_chunks(seq_, max_len-2):
+                    if sub_s[0] != cls_id:
+                        sub_s = np.insert(sub_s, 0, cls_id)
+                    if sub_s[-1] != sep_id:
+                        sub_s = np.insert(sub_s, len(sub_s), sep_id)
+                    assert len(sub_s) <= max_len
+                    sub_seqs.append(sub_s)
+
+                new_tok_ids.extend(sub_seqs)
+                new_lengths.extend([len(l) for l in sub_seqs])
+
+        self.token_ids = np.array(new_tok_ids)
+        self.lengths = np.array(new_lengths)
+
+    def remove_empty_sequences(self):
+        """
+        Too short sequences are simply removed. This could be tunedd.
+        """
+        init_size = len(self)
+        indices = self.lengths > 11
+        self.token_ids = self.token_ids[indices]
+        self.lengths = self.lengths[indices]
+        new_size = len(self)
+        logger.info(f'Remove {init_size - new_size} too short (<=11 tokens) sequences.')
+
+    def print_statistics(self):
+        """
+        Print some statistics on the corpus. Only the master process.
+        """
+        if not self.params.is_master:
+            return
+        logger.info(f'{len(self)} sequences')
+        # data_len = sum(self.lengths)
+        # nb_unique_tokens = len(Counter(list(chain(*self.token_ids))))
+        # logger.info(f'{data_len} tokens ({nb_unique_tokens} unique)')
+
+        # unk_idx = self.params.special_tok_ids['unk_token']
+        # nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
+        # logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')
+
+    def select_data(self, a: int, b: int):
+        """
+        Select a subportion of the data.
+        """
+        n_sequences = len(self)
+        assert 0 <= a < b <= n_sequences, ValueError(f'`0 <= a < b <= n_sequences` is not met with a={a} and b={b}')
+
+        logger.info(f'Selecting sequences from {a} to {b} (excluded).')
+        self.token_ids = self.token_ids[a:b]
+        self.lengths = self.lengths[a:b]
+
+        self.check()
+
+    def split(self):
+        """
+        Distributed training: split the data accross the processes.
+        """
+        assert self.params.n_gpu > 1
+        logger.info('Splitting the data accross the processuses.')
+        n_seq = len(self)
+        n_seq_per_procesus = n_seq // self.params.world_size
+        a = n_seq_per_procesus * self.params.global_rank
+        b = a + n_seq_per_procesus
+        self.select_data(a=a, b=b)
+
+    def batch_sequences(self,
+                        token_ids: List[List[int]],
+                        lengths: List[int]):
+        """
+        Do the padding and transform into torch.tensor.
+        """
+        assert len(token_ids) == len(lengths)
+
+        # Max for paddings
+        max_seq_len_ = max(lengths)
+
+        # Pad token ids
+        pad_idx = self.params.special_tok_ids['pad_token']
+        tk_ = [list(t.astype(int)) + [pad_idx]*(max_seq_len_-len(t)) for t in token_ids]
+        assert len(tk_) == len(token_ids)
+        assert all(len(t) == max_seq_len_ for t in tk_)
+
+        tk_t = torch.tensor(tk_)                  # (bs, max_seq_len_)
+        lg_t = torch.tensor(lengths.astype(int))  # (bs)
+        return tk_t, lg_t
+
+    def get_batches_iterator(self,
+                             batches):
+        """
+        Return an iterator over batches.
+        """
+        for sequences_ids in batches:
+            token_ids, lengths = self.batch_sequences(self.token_ids[sequences_ids],
+                                                    self.lengths[sequences_ids])
+            yield (token_ids, lengths)
+
+    def get_iterator(self,
+                     seed: int = None):
+        """
+        Return a data iterator.
+        """
+        rng = np.random.RandomState(seed)
+
+        n_sequences = len(self)
+        indices = np.arange(n_sequences)
+
+        if self.group_by_size:
+            indices = indices[np.argsort(self.lengths[indices], kind='mergesort')]
+
+        if self.tokens_per_batch == -1:
+            batches = np.array_split(indices, math.ceil(len(indices) * 1. / self.batch_size))
+        else:
+            assert self.tokens_per_batch > 0
+            batch_ids = np.cumsum(self.lengths[indices]) // self.tokens_per_batch
+            _, bounds = np.unique(batch_ids, return_index=True)
+            batches = [indices[bounds[i]:bounds[i + 1]] for i in range(len(bounds) - 1)]
+            if bounds[-1] < len(indices):
+                batches.append(indices[bounds[-1]:])
+
+        if self.shuffle:
+            rng.shuffle(batches)
+
+        assert n_sequences == sum([len(x) for x in batches])
+        assert self.lengths[indices].sum() == sum([self.lengths[x].sum() for x in batches])
+
+        return self.get_batches_iterator(batches=batches)
diff --git a/Optimus/code/examples/distillation/distiller.py b/Optimus/code/examples/distillation/distiller.py
new file mode 100755
index 0000000000000000000000000000000000000000..c22ee3b397877f977b677f8468cdeef4bfa89a98
--- /dev/null
+++ b/Optimus/code/examples/distillation/distiller.py
@@ -0,0 +1,490 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" The distiller to distil DistilBERT
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
+"""
+import os
+import math
+import psutil
+import time
+from tensorboardX import SummaryWriter
+from tqdm import trange, tqdm
+import numpy as np
+import psutil
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.optim import AdamW
+
+from pytorch_transformers import WarmupLinearSchedule
+
+from utils import logger
+from dataset import Dataset
+
+class Distiller:
+    def __init__(self,
+                 params: dict,
+                 dataloader: Dataset,
+                 token_probs: torch.tensor,
+                 student: nn.Module,
+                 teacher: nn.Module):
+        logger.info('Initializing Distiller')
+        self.params = params
+        self.dump_path = params.dump_path
+        self.multi_gpu = params.multi_gpu
+        self.fp16 = params.fp16
+
+        self.student = student
+        self.teacher = teacher
+
+        self.dataloader = dataloader
+        if self.params.n_gpu > 1:
+            self.dataloader.split()
+        self.get_iterator(seed=params.seed)
+
+        self.temperature = params.temperature
+        assert self.temperature > 0.
+
+        self.alpha_ce = params.alpha_ce
+        self.alpha_mlm = params.alpha_mlm
+        self.alpha_mse = params.alpha_mse
+        self.alpha_cos = params.alpha_cos
+        assert self.alpha_ce >= 0.
+        assert self.alpha_mlm >= 0.
+        assert self.alpha_mse >= 0.
+        assert self.alpha_cos >= 0.
+        assert self.alpha_ce + self.alpha_mlm + self.alpha_mse + self.alpha_cos > 0.
+
+        self.mlm_mask_prop = params.mlm_mask_prop
+        assert 0.0 <= self.mlm_mask_prop <= 1.0
+        assert params.word_mask + params.word_keep + params.word_rand == 1.0
+        self.pred_probs = torch.FloatTensor([params.word_mask, params.word_keep, params.word_rand])
+        self.pred_probs = self.pred_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else self.pred_probs
+        self.token_probs = token_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else token_probs
+        if self.fp16:
+            self.pred_probs = self.pred_probs.half()
+            self.token_probs = self.token_probs.half()
+
+        self.epoch = 0
+        self.n_iter = 0
+        self.n_total_iter = 0
+        self.n_sequences_epoch = 0
+        self.total_loss_epoch = 0
+        self.last_loss = 0
+        self.last_loss_ce = 0
+        self.last_loss_mlm = 0
+        if self.alpha_mse > 0.: self.last_loss_mse = 0
+        if self.alpha_cos > 0.: self.last_loss_cos = 0
+        self.last_log = 0
+
+        self.ce_loss_fct = nn.KLDivLoss(reduction='batchmean')
+        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+        if self.alpha_mse > 0.:
+            self.mse_loss_fct = nn.MSELoss(reduction='sum')
+        if self.alpha_cos > 0.:
+            self.cosine_loss_fct = nn.CosineEmbeddingLoss(reduction='mean')
+
+        logger.info('--- Initializing model optimizer')
+        assert params.gradient_accumulation_steps >= 1
+        self.num_steps_epoch = int(len(self.dataloader) / params.batch_size) + 1
+        num_train_optimization_steps = int(self.num_steps_epoch / params.gradient_accumulation_steps * params.n_epoch) + 1
+
+        no_decay = ['bias', 'LayerNorm.weight']
+        optimizer_grouped_parameters = [
+            {'params': [p for n, p in student.named_parameters() if not any(nd in n for nd in no_decay) and p.requires_grad], 'weight_decay': params.weight_decay},
+            {'params': [p for n, p in student.named_parameters() if any(nd in n for nd in no_decay) and p.requires_grad], 'weight_decay': 0.0}
+        ]
+        logger.info("------ Number of trainable parameters (student): %i" % sum([p.numel() for p in self.student.parameters() if p.requires_grad]))
+        logger.info("------ Number of parameters (student): %i" % sum([p.numel() for p in self.student.parameters()]))
+        self.optimizer = AdamW(optimizer_grouped_parameters,
+                               lr=params.learning_rate,
+                               eps=params.adam_epsilon,
+                               betas=(0.9, 0.98))
+
+        warmup_steps = math.ceil(num_train_optimization_steps * params.warmup_prop)
+        self.scheduler = WarmupLinearSchedule(self.optimizer,
+                                                warmup_steps=warmup_steps,
+                                                t_total=num_train_optimization_steps)
+
+        if self.fp16:
+            try:
+                from apex import amp
+            except ImportError:
+                raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+            logger.info(f"Using fp16 training: {self.params.fp16_opt_level} level")
+            self.student, self.optimizer = amp.initialize(self.student,
+                                                          self.optimizer,
+                                                          opt_level=self.params.fp16_opt_level)
+            self.teacher = self.teacher.half()
+
+        if self.multi_gpu:
+            if self.fp16:
+                from apex.parallel import DistributedDataParallel
+                logger.info("Using apex.parallel.DistributedDataParallel for distributed training.")
+                self.student = DistributedDataParallel(self.student)
+            else:
+                from torch.nn.parallel import DistributedDataParallel
+                logger.info("Using nn.parallel.DistributedDataParallel for distributed training.")
+                self.student = DistributedDataParallel(self.student,
+                                                       device_ids=[params.local_rank],
+                                                       output_device=params.local_rank)
+
+        self.is_master = params.is_master
+        if self.is_master:
+            logger.info('--- Initializing Tensorboard')
+            self.tensorboard = SummaryWriter(log_dir=os.path.join(self.dump_path, 'log', 'train'))
+            self.tensorboard.add_text(tag='config', text_string=str(self.params), global_step=0)
+
+    def get_iterator(self,
+                     seed: int = None):
+        """
+        Initialize the data iterator.
+        Each process has its own data iterator (iterating on his own random portion of the dataset).
+
+        Input:
+        ------
+            seed: `int` - The random seed.
+        """
+        logger.info('--- Initializing Data Iterator')
+        self.data_iterator = self.dataloader.get_iterator(seed=seed)
+
+    def get_batch(self):
+        """
+        Call the data iterator to output a new batch.
+        If the data iterator went through the whole dataset, create a new iterator.
+        """
+        assert hasattr(self, 'data_iterator')
+        try:
+            x = next(self.data_iterator)
+        except StopIteration:
+            logger.warning('--- Went through the whole dataset. Creating new data iterator.')
+            self.data_iterator = self.dataloader.get_iterator()
+            x = next(self.data_iterator)
+        return x
+
+    def prepare_batch(self,
+                      batch):
+        """
+        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
+
+        Input:
+        ------
+            batch: `Tuple`
+                token_ids: `torch.tensor(bs, seq_length)` - The token ids for each of the sequence. It is padded.
+                lengths: `torch.tensor(bs)` - The lengths of each of the sequences in the batch.
+
+        Output:
+        -------
+            token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
+            attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
+            mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -1 where there is nothing to predict.
+        """
+        token_ids, lengths = batch
+        token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
+        assert token_ids.size(0) == lengths.size(0)
+
+        attn_mask = (torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None])
+
+        bs, max_seq_len = token_ids.size()
+        mlm_labels = token_ids.new(token_ids.size()).copy_(token_ids)
+
+        x_prob = self.token_probs[token_ids.flatten()]
+        n_tgt = math.ceil(self.mlm_mask_prop * lengths.sum().item())
+        tgt_ids = torch.multinomial(x_prob / x_prob.sum(), n_tgt, replacement=False)
+        pred_mask = torch.zeros(bs * max_seq_len, dtype=torch.bool, device=token_ids.device) # previously `dtype=torch.uint8`, cf pytorch 1.2.0 compatibility
+        pred_mask[tgt_ids] = 1
+        pred_mask = pred_mask.view(bs, max_seq_len)
+
+        pred_mask[token_ids == self.params.special_tok_ids['pad_token']] = 0
+
+        # mask a number of words == 0 [8] (faster with fp16)
+        if self.fp16:
+            n1 = pred_mask.sum().item()
+            if n1 > 8:
+                pred_mask = pred_mask.view(-1)
+                n2 = max(n1 % 8, 8 * (n1 // 8))
+                if n2 != n1:
+                    pred_mask[torch.nonzero(pred_mask).view(-1)[:n1-n2]] = 0
+                pred_mask = pred_mask.view(bs, max_seq_len)
+                assert pred_mask.sum().item() % 8 == 0, pred_mask.sum().item()
+
+        _token_ids_real = token_ids[pred_mask]
+        _token_ids_rand = _token_ids_real.clone().random_(self.params.vocab_size)
+        _token_ids_mask = _token_ids_real.clone().fill_(self.params.special_tok_ids['mask_token'])
+        probs = torch.multinomial(self.pred_probs, len(_token_ids_real), replacement=True)
+        _token_ids = _token_ids_mask * (probs == 0).long() + _token_ids_real * (probs == 1).long() + _token_ids_rand * (probs == 2).long()
+        token_ids = token_ids.masked_scatter(pred_mask, _token_ids)
+
+        mlm_labels[~pred_mask] = -1 # previously `mlm_labels[1-pred_mask] = -1`, cf pytorch 1.2.0 compatibility
+
+        return token_ids, attn_mask, mlm_labels
+
+    def round_batch(self,
+                    x: torch.tensor,
+                    lengths: torch.tensor):
+        """
+        For float16 only.
+        Sub-sample sentences in a batch, and add padding, so that each dimension is a multiple of 8.
+
+        Input:
+        ------
+            x: `torch.tensor(bs, seq_length)` - The token ids.
+            lengths: `torch.tensor(bs, seq_length)` - The lengths of each of the sequence in the batch.
+
+        Output:
+        -------
+            x:  `torch.tensor(new_bs, new_seq_length)` - The updated token ids.
+            lengths: `torch.tensor(new_bs, new_seq_length)` - The updated lengths.
+        """
+        if not self.fp16 or len(lengths) < 8:
+            return x, lengths
+
+        # number of sentences == 0 [8]
+        bs1 = len(lengths)
+        bs2 = 8 * (bs1 // 8)
+        assert bs2 > 0 and bs2 % 8 == 0
+        if bs1 != bs2:
+            idx = torch.randperm(bs1)[:bs2]
+            lengths = lengths[idx]
+            slen = lengths.max().item()
+            x = x[idx, :slen]
+        else:
+            idx = None
+
+        # sequence length == 0 [8]
+        ml1 = x.size(1)
+        if ml1 % 8 != 0:
+            pad = 8 - (ml1 % 8)
+            ml2 = ml1 + pad
+            pad_id = self.params.special_tok_ids['pad_token']
+            padding_tensor = torch.zeros(bs2, pad, dtype=torch.long, device=x.device).fill_(pad_id)
+            x = torch.cat([x, padding_tensor], 1)
+            assert x.size() == (bs2, ml2)
+
+        assert x.size(0) % 8 == 0
+        assert x.size(1) % 8 == 0
+        return x, lengths
+
+    def train(self):
+        """
+        The real training loop.
+        """
+        if self.is_master: logger.info('Starting training')
+        self.last_log = time.time()
+        self.student.train()
+        self.teacher.eval()
+
+        for _ in range(self.params.n_epoch):
+            if self.is_master: logger.info(f'--- Starting epoch {self.epoch}/{self.params.n_epoch-1}')
+            if self.multi_gpu:
+                torch.distributed.barrier()
+
+            iter_bar = trange(self.num_steps_epoch, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
+            for __ in range(self.num_steps_epoch):
+                batch = self.get_batch()
+                if self.params.n_gpu > 0:
+                    batch = tuple(t.to(f'cuda:{self.params.local_rank}') for t in batch)
+                token_ids, attn_mask, mlm_labels = self.prepare_batch(batch=batch)
+
+                self.step(input_ids=token_ids, attention_mask=attn_mask, mlm_labels=mlm_labels)
+
+                iter_bar.update()
+                iter_bar.set_postfix({'Last_loss': f'{self.last_loss:.2f}',
+                                      'Avg_cum_loss': f'{self.total_loss_epoch/self.n_iter:.2f}'})
+            iter_bar.close()
+
+            if self.is_master: logger.info(f'--- Ending epoch {self.epoch}/{self.params.n_epoch-1}')
+            self.end_epoch()
+
+        if self.is_master:
+            logger.info(f'Save very last checkpoint as `pytorch_model.bin`.')
+            self.save_checkpoint(checkpoint_name=f'pytorch_model.bin')
+            logger.info('Training is finished')
+
+    def step(self,
+             input_ids: torch.tensor,
+             attention_mask: torch.tensor,
+             mlm_labels: torch.tensor):
+        """
+        One optimization step: forward of student AND teacher, backward on the loss (for gradient accumulation),
+        and possibly a parameter update (depending on the gradient accumulation).
+
+        Input:
+        ------
+        input_ids: `torch.tensor(bs, seq_length)` - The token ids.
+        attention_mask: `torch.tensor(bs, seq_length)` - The attention mask for self attention.
+        mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels.
+        """
+        s_logits, s_hidden_states = self.student(input_ids=input_ids, attention_mask=attention_mask)     # (bs, seq_length, voc_size)
+        with torch.no_grad():
+            t_logits, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
+        assert s_logits.size() == t_logits.size()
+
+        #https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
+        #https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
+        if self.params.restrict_ce_to_mask:
+            mask = (mlm_labels>-1).unsqueeze(-1).expand_as(s_logits)   # (bs, seq_lenth, voc_size)
+        else:
+            mask = attention_mask.unsqueeze(-1).expand_as(s_logits)    # (bs, seq_lenth, voc_size)
+        s_logits_slct = torch.masked_select(s_logits, mask)            # (bs * seq_length * voc_size) modulo the 1s in mask
+        s_logits_slct = s_logits_slct.view(-1, s_logits.size(-1))      # (bs * seq_length, voc_size) modulo the 1s in mask
+        t_logits_slct = torch.masked_select(t_logits, mask)            # (bs * seq_length * voc_size) modulo the 1s in mask
+        t_logits_slct = t_logits_slct.view(-1, s_logits.size(-1))      # (bs * seq_length, voc_size) modulo the 1s in mask
+        assert t_logits_slct.size() == s_logits_slct.size()
+
+        loss_ce = self.ce_loss_fct(F.log_softmax(s_logits_slct/self.temperature, dim=-1),
+                                   F.softmax(t_logits_slct/self.temperature, dim=-1)) * (self.temperature)**2
+        loss = self.alpha_ce*loss_ce
+        if self.alpha_mlm > 0.:
+            loss_mlm = self.mlm_loss_fct(s_logits.view(-1, s_logits.size(-1)), mlm_labels.view(-1))
+            loss += self.alpha_mlm * loss_mlm
+        if self.alpha_mse > 0.:
+            loss_mse = self.mse_loss_fct(s_logits_slct, t_logits_slct)/s_logits_slct.size(0) # Reproducing batchmean reduction
+            loss += self.alpha_mse * loss_mse
+        
+        if self.alpha_cos > 0.:
+            s_hidden_states = s_hidden_states[-1]                              # (bs, seq_length, dim)
+            t_hidden_states = t_hidden_states[-1]                              # (bs, seq_length, dim)
+            mask = attention_mask.unsqueeze(-1).expand_as(s_hidden_states)     # (bs, seq_length, dim)
+            assert s_hidden_states.size() == t_hidden_states.size()
+            dim = s_hidden_states.size(-1)
+            
+            s_hidden_states_slct = torch.masked_select(s_hidden_states, mask)        # (bs * seq_length * dim)
+            s_hidden_states_slct = s_hidden_states_slct.view(-1, dim)                # (bs * seq_length, dim)
+            t_hidden_states_slct = torch.masked_select(t_hidden_states, mask)        # (bs * seq_length * dim)
+            t_hidden_states_slct = t_hidden_states_slct.view(-1, dim)                # (bs * seq_length, dim)
+        
+            target = s_hidden_states_slct.new(s_hidden_states_slct.size(0)).fill_(1) # (bs * seq_length,)
+            loss_cos = self.cosine_loss_fct(s_hidden_states_slct, t_hidden_states_slct, target)
+            loss += self.alpha_cos * loss_cos
+
+        self.total_loss_epoch += loss.item()
+        self.last_loss = loss.item()
+        self.last_loss_ce = loss_ce.item()
+        if self.alpha_mlm > 0.:
+            self.last_loss_mlm = loss_mlm.item()
+        if self.alpha_mse > 0.:
+            self.last_loss_mse = loss_mse.item()
+        if self.alpha_cos > 0.:
+            self.last_loss_cos = loss_cos.item()
+
+        self.optimize(loss)
+
+        self.n_sequences_epoch += input_ids.size(0)
+
+    def optimize(self,
+                 loss):
+        """
+        Normalization on the loss (gradient accumulation or distributed training), followed by
+        backward pass on the loss, possibly followed by a parameter update (depending on the gradient accumulation).
+        Also update the metrics for tensorboard.
+        """
+        # Check for NaN
+        if (loss != loss).data.any():
+            logger.error('NaN detected')
+            exit()
+
+        if self.multi_gpu:
+            loss = loss.mean()
+        if self.params.gradient_accumulation_steps > 1:
+            loss = loss / self.params.gradient_accumulation_steps
+
+        if self.fp16:
+            from apex import amp
+            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
+                scaled_loss.backward()
+        else:
+            loss.backward()
+
+        self.iter()
+        if self.n_iter % self.params.gradient_accumulation_steps == 0:
+            if self.fp16:
+                torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), self.params.max_grad_norm)
+            else:
+                torch.nn.utils.clip_grad_norm_(self.student.parameters(), self.params.max_grad_norm)
+            self.optimizer.step()
+            self.optimizer.zero_grad()
+            self.scheduler.step()
+
+    def iter(self):
+        """
+        Update global counts, write to tensorboard and save checkpoint.
+        """
+        self.n_iter += 1
+        self.n_total_iter += 1
+
+        if self.n_total_iter % self.params.log_interval == 0:
+            self.log_tensorboard()
+            self.last_log = time.time()
+        if self.n_total_iter % self.params.checkpoint_interval == 0:
+            self.save_checkpoint()
+
+    def log_tensorboard(self):
+        """
+        Log into tensorboard. Only by the master process.
+        """
+        if not self.is_master:
+            return
+
+        for param_name, param in self.student.named_parameters():
+            self.tensorboard.add_scalar(tag='parameter_mean/' + param_name, scalar_value=param.data.mean(), global_step=self.n_total_iter)
+            self.tensorboard.add_scalar(tag='parameter_std/' + param_name, scalar_value=param.data.std(), global_step=self.n_total_iter)
+            if param.grad is None:
+                continue
+            self.tensorboard.add_scalar(tag="grad_mean/" + param_name, scalar_value=param.grad.data.mean(),global_step=self.n_total_iter)
+            self.tensorboard.add_scalar(tag="grad_std/" + param_name, scalar_value=param.grad.data.std(), global_step=self.n_total_iter)
+
+        self.tensorboard.add_scalar(tag="losses/cum_avg_loss_epoch", scalar_value=self.total_loss_epoch/self.n_iter, global_step=self.n_total_iter)
+        self.tensorboard.add_scalar(tag="losses/loss", scalar_value=self.last_loss, global_step=self.n_total_iter)
+        self.tensorboard.add_scalar(tag="losses/loss_ce", scalar_value=self.last_loss_ce, global_step=self.n_total_iter)
+        if self.alpha_mlm > 0.:
+            self.tensorboard.add_scalar(tag="losses/loss_mlm", scalar_value=self.last_loss_mlm, global_step=self.n_total_iter)
+        if self.alpha_mse > 0.:
+            self.tensorboard.add_scalar(tag="losses/loss_mse", scalar_value=self.last_loss_mse, global_step=self.n_total_iter)
+        if self.alpha_cos > 0.:
+            self.tensorboard.add_scalar(tag="losses/loss_cos", scalar_value=self.last_loss_cos, global_step=self.n_total_iter)
+        self.tensorboard.add_scalar(tag="learning_rate/lr", scalar_value=self.scheduler.get_lr()[0], global_step=self.n_total_iter)
+        
+        self.tensorboard.add_scalar(tag="global/memory_usage", scalar_value=psutil.virtual_memory()._asdict()['used']/1_000_000, global_step=self.n_total_iter)
+        self.tensorboard.add_scalar(tag="global/speed", scalar_value=time.time()-self.last_log, global_step=self.n_total_iter)
+
+    def end_epoch(self):
+        """
+        Finally arrived at the end of epoch (full pass on dataset).
+        Do some tensorboard logging and checkpoint saving.
+        """
+        logger.info(f'{self.n_sequences_epoch} sequences have been trained during this epoch.')
+
+        if self.is_master:
+            self.save_checkpoint(checkpoint_name=f'model_epoch_{self.epoch}.pth')
+            self.tensorboard.add_scalar(tag='epoch/loss', scalar_value=self.total_loss_epoch/self.n_iter, global_step=self.epoch)
+
+        self.epoch += 1
+        self.n_sequences_epoch = 0
+        self.n_iter = 0
+        self.total_loss_epoch = 0
+
+    def save_checkpoint(self,
+                        checkpoint_name: str = 'checkpoint.pth'):
+        """
+        Save the current state. Only by the master process.
+        """
+        if not self.is_master:
+            return
+        mdl_to_save = self.student.module if hasattr(self.student, 'module') else self.student
+        mdl_to_save.config.save_pretrained(self.dump_path)
+        state_dict = mdl_to_save.state_dict()
+        torch.save(state_dict, os.path.join(self.dump_path, checkpoint_name))
diff --git a/Optimus/code/examples/distillation/requirements.txt b/Optimus/code/examples/distillation/requirements.txt
new file mode 100755
index 0000000000000000000000000000000000000000..2cf6ee2d8197c45e1721a911d720ecac516e5d49
--- /dev/null
+++ b/Optimus/code/examples/distillation/requirements.txt
@@ -0,0 +1,6 @@
+gitpython==3.0.2
+tensorboard>=1.14.0
+tensorboardX==1.8
+psutil==5.6.3
+scipy==1.3.1
+pytorch_transformers==1.2.0
diff --git a/Optimus/code/examples/distillation/scripts/binarized_data.py b/Optimus/code/examples/distillation/scripts/binarized_data.py
new file mode 100755
index 0000000000000000000000000000000000000000..de9e39fff3bde7ac92940fb6a2db02085d8045d4
--- /dev/null
+++ b/Optimus/code/examples/distillation/scripts/binarized_data.py
@@ -0,0 +1,86 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocessing script before training DistilBERT.
+"""
+import argparse
+import pickle
+import random
+import time
+import numpy as np
+from pytorch_transformers import BertTokenizer, RobertaTokenizer
+import logging
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+def main():
+    parser = argparse.ArgumentParser(description="Preprocess the data to avoid re-doing it several times by (tokenization + token_to_ids).")
+    parser.add_argument('--file_path', type=str, default='data/dump.txt',
+                        help='The path to the data.')
+    parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta'])
+    parser.add_argument('--tokenizer_name', type=str, default='bert-base-uncased',
+                        help="The tokenizer to use.")
+    parser.add_argument('--dump_file', type=str, default='data/dump',
+                        help='The dump file prefix.')
+    args = parser.parse_args()
+
+
+    logger.info(f'Loading Tokenizer ({args.tokenizer_name})')
+    if args.tokenizer_type == 'bert':
+        tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)
+    elif args.tokenizer_type == 'roberta':
+        tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name)
+    bos = tokenizer.special_tokens_map['bos_token'] # `[CLS]` for bert, `<s>` for roberta
+    sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]` for bert, `</s>` for roberta
+
+    logger.info(f'Loading text from {args.file_path}')
+    with open(args.file_path, 'r', encoding='utf8') as fp:
+        data = fp.readlines()
+
+
+    logger.info(f'Start encoding')
+    logger.info(f'{len(data)} examples to process.')
+
+    rslt = []
+    iter = 0
+    interval = 10000
+    start = time.time()
+    for text in data:
+        text = f'{bos} {text.strip()} {sep}'
+        token_ids = tokenizer.encode(text)
+        rslt.append(token_ids)
+
+        iter += 1
+        if iter % interval == 0:
+            end = time.time()
+            logger.info(f'{iter} examples processed. - {(end-start)/interval:.2f}s/expl')
+            start = time.time()
+    logger.info('Finished binarization')
+    logger.info(f'{len(data)} examples processed.')
+
+
+    dp_file = f'{args.dump_file}.{args.tokenizer_name}.pickle'
+    rslt_ = [np.uint16(d) for d in rslt]
+    random.shuffle(rslt_)
+    logger.info(f'Dump to {dp_file}')
+    with open(dp_file, 'wb') as handle:
+        pickle.dump(rslt_, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/distillation/scripts/extract_for_distil.py b/Optimus/code/examples/distillation/scripts/extract_for_distil.py
new file mode 100755
index 0000000000000000000000000000000000000000..43554d1c9f2e24634ec35f3ab34d2f03818ba2c9
--- /dev/null
+++ b/Optimus/code/examples/distillation/scripts/extract_for_distil.py
@@ -0,0 +1,90 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocessing script before training DistilBERT.
+"""
+from pytorch_transformers import BertForMaskedLM, RobertaForMaskedLM
+import torch
+import argparse
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Extraction some layers of the full BertForMaskedLM or RObertaForMaskedLM for Transfer Learned Distillation")
+    parser.add_argument("--model_type", default="bert", choices=["bert", "roberta"])
+    parser.add_argument("--model_name", default='bert-base-uncased', type=str)
+    parser.add_argument("--dump_checkpoint", default='serialization_dir/tf_bert-base-uncased_0247911.pth', type=str)
+    parser.add_argument("--vocab_transform", action='store_true')
+    args = parser.parse_args()
+
+
+    if args.model_type == 'bert':
+        model = BertForMaskedLM.from_pretrained(args.model_name)
+        prefix = 'bert'
+    elif args.model_type == 'roberta':
+        model = RobertaForMaskedLM.from_pretrained(args.model_name)
+        prefix = 'roberta'
+
+    state_dict = model.state_dict()
+    compressed_sd = {}
+
+    for w in ['word_embeddings', 'position_embeddings']:
+        compressed_sd[f'distilbert.embeddings.{w}.weight'] = \
+            state_dict[f'{prefix}.embeddings.{w}.weight']
+    for w in ['weight', 'bias']:
+        compressed_sd[f'distilbert.embeddings.LayerNorm.{w}'] = \
+            state_dict[f'{prefix}.embeddings.LayerNorm.{w}']
+
+    std_idx = 0
+    for teacher_idx in [0, 2, 4, 7, 9, 11]:
+        for w in ['weight', 'bias']:
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.self.query.{w}']
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.self.key.{w}']
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.self.value.{w}']
+
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.output.dense.{w}']
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.output.LayerNorm.{w}']
+
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.intermediate.dense.{w}']
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.output.dense.{w}']
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
+                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
+        std_idx += 1
+
+    if args.model_type == 'bert':
+        compressed_sd[f'vocab_projector.weight'] = state_dict[f'cls.predictions.decoder.weight']
+        compressed_sd[f'vocab_projector.bias'] = state_dict[f'cls.predictions.bias']
+        if args.vocab_transform:
+            for w in ['weight', 'bias']:
+                compressed_sd[f'vocab_transform.{w}'] = state_dict[f'cls.predictions.transform.dense.{w}']
+                compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'cls.predictions.transform.LayerNorm.{w}']
+    elif args.model_type == 'roberta':
+        compressed_sd[f'vocab_projector.weight'] = state_dict[f'lm_head.decoder.weight']
+        compressed_sd[f'vocab_projector.bias'] = state_dict[f'lm_head.bias']
+        if args.vocab_transform:
+            for w in ['weight', 'bias']:
+                compressed_sd[f'vocab_transform.{w}'] = state_dict[f'lm_head.dense.{w}']
+                compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'lm_head.layer_norm.{w}']
+
+    print(f'N layers selected for distillation: {std_idx}')
+    print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
+
+    print(f'Save transfered checkpoint to {args.dump_checkpoint}.')
+    torch.save(compressed_sd, args.dump_checkpoint)
diff --git a/Optimus/code/examples/distillation/scripts/token_counts.py b/Optimus/code/examples/distillation/scripts/token_counts.py
new file mode 100755
index 0000000000000000000000000000000000000000..a484a6f51b3f659a7294e592b407bc9a8bd6e5b1
--- /dev/null
+++ b/Optimus/code/examples/distillation/scripts/token_counts.py
@@ -0,0 +1,51 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocessing script before training DistilBERT.
+"""
+from collections import Counter
+import argparse
+import pickle
+import logging
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Token Counts for smoothing the masking probabilities in MLM (cf XLM/word2vec)")
+    parser.add_argument("--data_file", type=str, default="data/dump.bert-base-uncased.pickle",
+                        help="The binarized dataset.")
+    parser.add_argument("--token_counts_dump", type=str, default="data/token_counts.bert-base-uncased.pickle",
+                        help="The dump file.")
+    parser.add_argument("--vocab_size", default=30522, type=int)
+    args = parser.parse_args()
+
+    logger.info(f'Loading data from {args.data_file}')
+    with open(args.data_file, 'rb') as fp:
+        data = pickle.load(fp)
+
+    logger.info('Counting occurences for MLM.')
+    counter = Counter()
+    for tk_ids in data:
+        counter.update(tk_ids)
+    counts = [0]*args.vocab_size
+    for k, v in counter.items():
+        counts[k] = v
+
+    logger.info(f'Dump to {args.token_counts_dump}')
+    with open(args.token_counts_dump, 'wb') as handle:
+        pickle.dump(counts, handle, protocol=pickle.HIGHEST_PROTOCOL)
diff --git a/Optimus/code/examples/distillation/train.py b/Optimus/code/examples/distillation/train.py
new file mode 100755
index 0000000000000000000000000000000000000000..5cbb7e2dcde4533eb278dc551e1fc68784acb331
--- /dev/null
+++ b/Optimus/code/examples/distillation/train.py
@@ -0,0 +1,247 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Training DistilBERT.
+"""
+import os
+import argparse
+import pickle
+import json
+import shutil
+import numpy as np
+import torch
+
+from pytorch_transformers import BertTokenizer, BertForMaskedLM, RobertaTokenizer, RobertaForMaskedLM
+from pytorch_transformers import DistilBertForMaskedLM, DistilBertConfig
+
+from distiller import Distiller
+from utils import git_log, logger, init_gpu_params, set_seed
+from dataset import Dataset
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Training")
+
+    parser.add_argument("--dump_path", type=str, required=True,
+                        help="The output directory (log, checkpoints, parameters, etc.)")
+    parser.add_argument("--data_file", type=str, required=True,
+                        help="The binarized file (tokenized + tokens_to_ids) and grouped by sequence.")
+    parser.add_argument("--token_counts", type=str, required=True,
+                        help="The token counts in the data_file for MLM.")
+    parser.add_argument("--force", action='store_true',
+                        help="Overwrite dump_path if it already exists.")
+
+    parser.add_argument("--vocab_size", default=30522, type=int,
+                        help="The vocabulary size.")
+    parser.add_argument("--max_position_embeddings", default=512, type=int,
+                        help="Maximum sequence length we can model (including [CLS] and [SEP]).")
+    parser.add_argument("--sinusoidal_pos_embds", action='store_false',
+                        help="If true, the position embeddings are simply fixed with sinusoidal embeddings.")
+    parser.add_argument("--n_layers", default=6, type=int,
+                        help="Number of Transformer blocks.")
+    parser.add_argument("--n_heads", default=12, type=int,
+                        help="Number of heads in the self-attention module.")
+    parser.add_argument("--dim", default=768, type=int,
+                        help="Dimension through the network. Must be divisible by n_heads")
+    parser.add_argument("--hidden_dim", default=3072, type=int,
+                        help="Intermediate dimension in the FFN.")
+    parser.add_argument("--dropout", default=0.1, type=float,
+                        help="Dropout.")
+    parser.add_argument("--attention_dropout", default=0.1, type=float,
+                        help="Dropout in self-attention.")
+    parser.add_argument("--activation", default='gelu', type=str,
+                        help="Activation to use in self-attention")
+    parser.add_argument("--tie_weights_", action='store_false',
+                        help="If true, we tie the embeddings matrix with the projection over the vocabulary matrix. Default is true.")
+
+    parser.add_argument("--from_pretrained_weights", default=None, type=str,
+                        help="Load student initialization checkpoint.")
+    parser.add_argument("--from_pretrained_config", default=None, type=str,
+                        help="Load student initialization architecture config.")
+    parser.add_argument("--teacher_type", default="bert", choices=["bert", "roberta"],
+                        help="Teacher type (BERT, RoBERTa).")
+    parser.add_argument("--teacher_name", default="bert-base-uncased", type=str,
+                        help="The teacher model.")
+
+    parser.add_argument("--temperature", default=2., type=float,
+                        help="Temperature for the softmax temperature.")
+    parser.add_argument("--alpha_ce", default=0.5, type=float,
+                        help="Linear weight for the distillation loss. Must be >=0.")
+    parser.add_argument("--alpha_mlm", default=0.5, type=float,
+                        help="Linear weight for the MLM loss. Must be >=0.")
+    parser.add_argument("--alpha_mse", default=0.0, type=float,
+                        help="Linear weight of the MSE loss. Must be >=0.")
+    parser.add_argument("--alpha_cos", default=0.0, type=float,
+                        help="Linear weight of the cosine embedding loss. Must be >=0.")
+    parser.add_argument("--mlm_mask_prop", default=0.15, type=float,
+                        help="Proportion of tokens for which we need to make a prediction.")
+    parser.add_argument("--word_mask", default=0.8, type=float,
+                        help="Proportion of tokens to mask out.")
+    parser.add_argument("--word_keep", default=0.1, type=float,
+                        help="Proportion of tokens to keep.")
+    parser.add_argument("--word_rand", default=0.1, type=float,
+                        help="Proportion of tokens to randomly replace.")
+    parser.add_argument("--mlm_smoothing", default=0.7, type=float,
+                        help="Smoothing parameter to emphasize more rare tokens (see XLM, similar to word2vec).")
+    parser.add_argument("--restrict_ce_to_mask", action='store_true',
+                        help="If true, compute the distilation loss only the [MLM] prediction distribution.")
+
+    parser.add_argument("--n_epoch", type=int, default=3,
+                        help="Number of pass on the whole dataset.")
+    parser.add_argument("--batch_size", type=int, default=5,
+                        help="Batch size (for each process).")
+    parser.add_argument("--tokens_per_batch", type=int, default=-1,
+                        help="If specified, modify the batches so that they have approximately this number of tokens.")
+    parser.add_argument("--shuffle", action='store_false',
+                        help="If true, shuffle the sequence order. Default is true.")
+    parser.add_argument("--group_by_size", action='store_false',
+                        help="If true, group sequences that have similar length into the same batch. Default is true.")
+
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=50,
+                        help="Gradient accumulation for larger training batches.")
+    parser.add_argument("--warmup_prop", default=0.05, type=float,
+                        help="Linear warmup proportion.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--learning_rate", default=5e-4, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=5.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--initializer_range", default=0.02, type=float,
+                        help="Random initialization range.")
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--n_gpu", type=int, default=1,
+                        help="Number of GPUs in the node.")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="Distributed training - Local rank")
+    parser.add_argument("--seed", type=int, default=56,
+                        help="Random seed")
+
+    parser.add_argument("--log_interval", type=int, default=500,
+                        help="Tensorboard logging interval.")
+    parser.add_argument("--checkpoint_interval", type=int, default=4000,
+                        help="Checkpoint interval.")
+    args = parser.parse_args()
+
+
+    ## ARGS ##
+    init_gpu_params(args)
+    set_seed(args)
+    if args.is_master:
+        if os.path.exists(args.dump_path):
+            if not args.force:
+                raise ValueError(f'Serialization dir {args.dump_path} already exists, but you have not precised wheter to overwrite it'
+                                   'Use `--force` if you want to overwrite it')
+            else:
+                shutil.rmtree(args.dump_path)
+
+        if not os.path.exists(args.dump_path):
+            os.makedirs(args.dump_path)
+        logger.info(f'Experiment will be dumped and logged in {args.dump_path}')
+
+
+        ### SAVE PARAMS ###
+        logger.info(f'Param: {args}')
+        with open(os.path.join(args.dump_path, 'parameters.json'), 'w') as f:
+            json.dump(vars(args), f, indent=4)
+        git_log(args.dump_path)
+    assert (args.from_pretrained_weights is None and args.from_pretrained_config is None) or \
+           (args.from_pretrained_weights is not None and args.from_pretrained_config is not None)
+
+
+    ### TOKENIZER ###
+    if args.teacher_type == 'bert':
+        tokenizer = BertTokenizer.from_pretrained(args.teacher_name)
+    elif args.teacher_type == 'roberta':
+        tokenizer = RobertaTokenizer.from_pretrained(args.teacher_name)
+    special_tok_ids = {}
+    for tok_name, tok_symbol in tokenizer.special_tokens_map.items():
+        idx = tokenizer.all_special_tokens.index(tok_symbol)
+        special_tok_ids[tok_name] = tokenizer.all_special_ids[idx]
+    logger.info(f'Special tokens {special_tok_ids}')
+    args.special_tok_ids = special_tok_ids
+
+
+    ## DATA LOADER ##
+    logger.info(f'Loading data from {args.data_file}')
+    with open(args.data_file, 'rb') as fp:
+        data = pickle.load(fp)
+
+
+    assert os.path.isfile(args.token_counts)
+    logger.info(f'Loading token counts from {args.token_counts} (already pre-computed)')
+    with open(args.token_counts, 'rb') as fp:
+        counts = pickle.load(fp)
+        assert len(counts) == args.vocab_size
+    token_probs = np.maximum(counts, 1) ** -args.mlm_smoothing
+    for idx in special_tok_ids.values():
+        token_probs[idx] = 0.  # do not predict special tokens
+    token_probs = torch.from_numpy(token_probs)
+
+
+    train_dataloader = Dataset(params=args, data=data)
+    logger.info(f'Data loader created.')
+
+
+    ## STUDENT ##
+    if args.from_pretrained_weights is not None:
+        assert os.path.isfile(args.from_pretrained_weights)
+        assert os.path.isfile(args.from_pretrained_config)
+        logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
+        logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
+        stu_architecture_config = DistilBertConfig.from_json_file(args.from_pretrained_config)
+        stu_architecture_config.output_hidden_states = True
+        student = DistilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
+                                                        config=stu_architecture_config)
+    else:
+        args.vocab_size_or_config_json_file = args.vocab_size
+        stu_architecture_config = DistilBertConfig(**vars(args), output_hidden_states=True)
+        student = DistilBertForMaskedLM(stu_architecture_config)
+
+
+    if args.n_gpu > 0:
+        student.to(f'cuda:{args.local_rank}')
+    logger.info(f'Student loaded.')
+
+
+    ## TEACHER ##
+    if args.teacher_type == 'bert':
+        teacher = BertForMaskedLM.from_pretrained(args.teacher_name, output_hidden_states=True)
+    elif args.teacher_type == 'roberta':
+        teacher = RobertaForMaskedLM.from_pretrained(args.teacher_name, output_hidden_states=True)
+    if args.n_gpu > 0:
+        teacher.to(f'cuda:{args.local_rank}')
+    logger.info(f'Teacher loaded from {args.teacher_name}.')
+
+    ## DISTILLER ##
+    torch.cuda.empty_cache()
+    distiller = Distiller(params=args,
+                          dataloader=train_dataloader,
+                          token_probs=token_probs,
+                          student=student,
+                          teacher=teacher)
+    distiller.train()
+    logger.info("Let's go get some drinks.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/distillation/utils.py b/Optimus/code/examples/distillation/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..3d625047108879aee354259b7da57b10045396d3
--- /dev/null
+++ b/Optimus/code/examples/distillation/utils.py
@@ -0,0 +1,129 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Utils to train DistilBERT
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
+"""
+import git
+import json
+import os
+import socket
+import torch
+import numpy as np
+
+import logging
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - PID: %(process)d -  %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+def git_log(folder_path: str):
+    """
+    Log commit info.
+    """
+    repo = git.Repo(search_parent_directories=True)
+    repo_infos = {
+        'repo_id': str(repo),
+        'repo_sha': str(repo.head.object.hexsha),
+        'repo_branch': str(repo.active_branch)
+    }
+
+    with open(os.path.join(folder_path, 'git_log.json'), 'w') as f:
+        json.dump(repo_infos, f, indent=4)
+
+
+def init_gpu_params(params):
+    """
+    Handle single and multi-GPU / multi-node.
+    """
+    if params.n_gpu <= 0:
+        params.local_rank = 0
+        params.master_port = -1
+        params.is_master = True
+        params.multi_gpu = False
+        return
+
+    assert torch.cuda.is_available()
+
+    logger.info('Initializing GPUs')
+    if params.n_gpu > 1:
+        assert params.local_rank != -1
+
+        params.world_size = int(os.environ['WORLD_SIZE'])
+        params.n_gpu_per_node = int(os.environ['N_GPU_NODE'])
+        params.global_rank = int(os.environ['RANK'])
+
+        # number of nodes / node ID
+        params.n_nodes = params.world_size // params.n_gpu_per_node
+        params.node_id = params.global_rank // params.n_gpu_per_node
+        params.multi_gpu = True
+
+        assert params.n_nodes == int(os.environ['N_NODES'])
+        assert params.node_id == int(os.environ['NODE_RANK'])
+
+    # local job (single GPU)
+    else:
+        assert params.local_rank == -1
+
+        params.n_nodes = 1
+        params.node_id = 0
+        params.local_rank = 0
+        params.global_rank = 0
+        params.world_size = 1
+        params.n_gpu_per_node = 1
+        params.multi_gpu = False
+
+    # sanity checks
+    assert params.n_nodes >= 1
+    assert 0 <= params.node_id < params.n_nodes
+    assert 0 <= params.local_rank <= params.global_rank < params.world_size
+    assert params.world_size == params.n_nodes * params.n_gpu_per_node
+
+    # define whether this is the master process / if we are in multi-node distributed mode
+    params.is_master = params.node_id == 0 and params.local_rank == 0
+    params.multi_node = params.n_nodes > 1
+
+    # summary
+    PREFIX = f"--- Global rank: {params.global_rank} - "
+    logger.info(PREFIX + "Number of nodes: %i" % params.n_nodes)
+    logger.info(PREFIX + "Node ID        : %i" % params.node_id)
+    logger.info(PREFIX + "Local rank     : %i" % params.local_rank)
+    logger.info(PREFIX + "World size     : %i" % params.world_size)
+    logger.info(PREFIX + "GPUs per node  : %i" % params.n_gpu_per_node)
+    logger.info(PREFIX + "Master         : %s" % str(params.is_master))
+    logger.info(PREFIX + "Multi-node     : %s" % str(params.multi_node))
+    logger.info(PREFIX + "Multi-GPU      : %s" % str(params.multi_gpu))
+    logger.info(PREFIX + "Hostname       : %s" % socket.gethostname())
+
+    # set GPU device
+    torch.cuda.set_device(params.local_rank)
+
+    # initialize multi-GPU
+    if params.multi_gpu:
+        logger.info("Initializing PyTorch distributed")
+        torch.distributed.init_process_group(
+            init_method='env://',
+            backend='nccl',
+        )
+
+
+def set_seed(args):
+    """
+    Set the random seed.
+    """
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
diff --git a/Optimus/code/examples/requirements.txt b/Optimus/code/examples/requirements.txt
new file mode 100755
index 0000000000000000000000000000000000000000..42abe8933c6d9d7440484f6c8db2063b2fb5442e
--- /dev/null
+++ b/Optimus/code/examples/requirements.txt
@@ -0,0 +1,2 @@
+tensorboardX
+scikit-learn
\ No newline at end of file
diff --git a/Optimus/code/examples/run_bertology.py b/Optimus/code/examples/run_bertology.py
new file mode 100755
index 0000000000000000000000000000000000000000..f11b73b54f033b1c5223407772566c5567698be1
--- /dev/null
+++ b/Optimus/code/examples/run_bertology.py
@@ -0,0 +1,348 @@
+#!/usr/bin/env python3
+# Copyright 2018 CMU and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Bertology: this script shows how you can explore the internals of the models in the library to:
+    - compute the entropy of the head attentions
+    - compute the importance of each head
+    - prune (remove) the low importance head.
+    Some parts of this script are adapted from the code of Michel et al. (http://arxiv.org/abs/1905.10650)
+    which is available at https://github.com/pmichel31415/are-16-heads-really-better-than-1
+"""
+import os
+import argparse
+import logging
+from datetime import timedelta, datetime
+from tqdm import tqdm
+
+import numpy as np
+
+import torch
+from torch.utils.data import DataLoader, SequentialSampler, TensorDataset, Subset
+from torch.utils.data.distributed import DistributedSampler
+from torch.nn import CrossEntropyLoss, MSELoss
+
+from pytorch_transformers import (WEIGHTS_NAME,
+                                  BertConfig, BertForSequenceClassification, BertTokenizer,
+                                  XLMConfig, XLMForSequenceClassification, XLMTokenizer,
+                                  XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer)
+
+from run_glue import set_seed, load_and_cache_examples, ALL_MODELS, MODEL_CLASSES
+
+from utils_glue import (compute_metrics, convert_examples_to_features,
+                        output_modes, processors)
+
+logger = logging.getLogger(__name__)
+
+
+def entropy(p):
+    """ Compute the entropy of a probability distribution """
+    plogp = p * torch.log(p)
+    plogp[p == 0] = 0
+    return -plogp.sum(dim=-1)
+
+
+def print_2d_tensor(tensor):
+    """ Print a 2D tensor """
+    logger.info("lv, h >\t" + "\t".join(f"{x + 1}" for x in range(len(tensor))))
+    for row in range(len(tensor)):
+        if tensor.dtype != torch.long:
+            logger.info(f"layer {row + 1}:\t" + "\t".join(f"{x:.5f}" for x in tensor[row].cpu().data))
+        else:
+            logger.info(f"layer {row + 1}:\t" + "\t".join(f"{x:d}" for x in tensor[row].cpu().data))
+
+
+def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None):
+    """ This method shows how to compute:
+        - head attention entropy
+        - head importance scores according to http://arxiv.org/abs/1905.10650
+    """
+    # Prepare our tensors
+    n_layers, n_heads = model.bert.config.num_hidden_layers, model.bert.config.num_attention_heads
+    head_importance = torch.zeros(n_layers, n_heads).to(args.device)
+    attn_entropy = torch.zeros(n_layers, n_heads).to(args.device)
+
+    if head_mask is None:
+        head_mask = torch.ones(n_layers, n_heads).to(args.device)
+    head_mask.requires_grad_(requires_grad=True)
+    preds = None
+    labels = None
+    tot_tokens = 0.0
+
+    for step, batch in enumerate(tqdm(eval_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
+        batch = tuple(t.to(args.device) for t in batch)
+        input_ids, input_mask, segment_ids, label_ids = batch
+
+        # Do a forward pass (not with torch.no_grad() since we need gradients for importance score - see below)
+        outputs = model(input_ids, token_type_ids=segment_ids, attention_mask=input_mask, labels=label_ids, head_mask=head_mask)
+        loss, logits, all_attentions = outputs[0], outputs[1], outputs[-1]  # Loss and logits are the first, attention the last
+        loss.backward()  # Backpropagate to populate the gradients in the head mask
+
+        if compute_entropy:
+            for layer, attn in enumerate(all_attentions):
+                masked_entropy = entropy(attn.detach()) * input_mask.float().unsqueeze(1)
+                attn_entropy[layer] += masked_entropy.sum(-1).sum(0).detach()
+
+        if compute_importance:
+            head_importance += head_mask.grad.abs().detach()
+
+        # Also store our logits/labels if we want to compute metrics afterwards
+        if preds is None:
+            preds = logits.detach().cpu().numpy()
+            labels = label_ids.detach().cpu().numpy()
+        else:
+            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
+            labels = np.append(labels, label_ids.detach().cpu().numpy(), axis=0)
+
+        tot_tokens += input_mask.float().detach().sum().data
+
+    # Normalize
+    attn_entropy /= tot_tokens
+    head_importance /= tot_tokens
+    # Layerwise importance normalization
+    if not args.dont_normalize_importance_by_layer:
+        exponent = 2
+        norm_by_layer = torch.pow(torch.pow(head_importance, exponent).sum(-1), 1/exponent)
+        head_importance /= norm_by_layer.unsqueeze(-1) + 1e-20
+
+    if not args.dont_normalize_global_importance:
+        head_importance = (head_importance - head_importance.min()) / (head_importance.max() - head_importance.min())
+
+    # Print/save matrices
+    np.save(os.path.join(args.output_dir, 'attn_entropy.npy'), attn_entropy.detach().cpu().numpy())
+    np.save(os.path.join(args.output_dir, 'head_importance.npy'), head_importance.detach().cpu().numpy())
+
+    logger.info("Attention entropies")
+    print_2d_tensor(attn_entropy)
+    logger.info("Head importance scores")
+    print_2d_tensor(head_importance)
+    logger.info("Head ranked by importance scores")
+    head_ranks = torch.zeros(head_importance.numel(), dtype=torch.long, device=args.device)
+    head_ranks[head_importance.view(-1).sort(descending=True)[1]] = torch.arange(head_importance.numel(), device=args.device)
+    head_ranks = head_ranks.view_as(head_importance)
+    print_2d_tensor(head_ranks)
+
+    return attn_entropy, head_importance, preds, labels
+
+
+def mask_heads(args, model, eval_dataloader):
+    """ This method shows how to mask head (set some heads to zero), to test the effect on the network,
+        based on the head importance scores, as described in Michel et al. (http://arxiv.org/abs/1905.10650)
+    """
+    _, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False)
+    preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
+    original_score = compute_metrics(args.task_name, preds, labels)[args.metric_name]
+    logger.info("Pruning: original score: %f, threshold: %f", original_score, original_score * args.masking_threshold)
+
+    new_head_mask = torch.ones_like(head_importance)
+    num_to_mask = max(1, int(new_head_mask.numel() * args.masking_amount))
+
+    current_score = original_score
+    while current_score >= original_score * args.masking_threshold:
+        head_mask = new_head_mask.clone() # save current head mask
+        # heads from least important to most - keep only not-masked heads
+        head_importance[head_mask == 0.0] = float('Inf')
+        current_heads_to_mask = head_importance.view(-1).sort()[1]
+
+        if len(current_heads_to_mask) <= num_to_mask:
+            break
+
+        # mask heads
+        current_heads_to_mask = current_heads_to_mask[:num_to_mask]
+        logger.info("Heads to mask: %s", str(current_heads_to_mask.tolist()))
+        new_head_mask = new_head_mask.view(-1)
+        new_head_mask[current_heads_to_mask] = 0.0
+        new_head_mask = new_head_mask.view_as(head_mask)
+        print_2d_tensor(new_head_mask)
+
+        # Compute metric and head importance again
+        _, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask)
+        preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
+        current_score = compute_metrics(args.task_name, preds, labels)[args.metric_name]
+        logger.info("Masking: current score: %f, remaning heads %d (%.1f percents)", current_score, new_head_mask.sum(), new_head_mask.sum()/new_head_mask.numel() * 100)
+
+    logger.info("Final head mask")
+    print_2d_tensor(head_mask)
+    np.save(os.path.join(args.output_dir, 'head_mask.npy'), head_mask.detach().cpu().numpy())
+
+    return head_mask
+
+
+def prune_heads(args, model, eval_dataloader, head_mask):
+    """ This method shows how to prune head (remove heads weights) based on
+        the head importance scores as described in Michel et al. (http://arxiv.org/abs/1905.10650)
+    """
+    # Try pruning and test time speedup
+    # Pruning is like masking but we actually remove the masked weights
+    before_time = datetime.now()
+    _, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
+                                                   compute_entropy=False, compute_importance=False, head_mask=head_mask)
+    preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
+    score_masking = compute_metrics(args.task_name, preds, labels)[args.metric_name]
+    original_time = datetime.now() - before_time
+
+    original_num_params = sum(p.numel() for p in model.parameters())
+    heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
+    assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
+    model.prune_heads(heads_to_prune)
+    pruned_num_params = sum(p.numel() for p in model.parameters())
+
+    before_time = datetime.now()
+    _, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
+                                                    compute_entropy=False, compute_importance=False, head_mask=None)
+    preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
+    score_pruning = compute_metrics(args.task_name, preds, labels)[args.metric_name]
+    new_time = datetime.now() - before_time
+
+    logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)", original_num_params, pruned_num_params, pruned_num_params/original_num_params * 100)
+    logger.info("Pruning: score with masking: %f score with pruning: %f", score_masking, score_pruning)
+    logger.info("Pruning: speed ratio (new timing / original timing): %f percents", original_time/new_time * 100)
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--data_dir", default=None, type=str, required=True,
+                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(
+                            ALL_MODELS))
+    parser.add_argument("--task_name", default=None, type=str, required=True,
+                        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+    parser.add_argument("--data_subset", type=int, default=-1,
+                        help="If > 0: limit the data to a subset of data_subset instances.")
+    parser.add_argument("--overwrite_output_dir", action='store_true',
+                        help="Whether to overwrite data in output directory")
+
+    parser.add_argument("--dont_normalize_importance_by_layer", action='store_true',
+                        help="Don't normalize importance score by layers")
+    parser.add_argument("--dont_normalize_global_importance", action='store_true',
+                        help="Don't normalize all importance scores between 0 and 1")
+
+    parser.add_argument("--try_masking", action='store_true',
+                        help="Whether to try to mask head until a threshold of accuracy.")
+    parser.add_argument("--masking_threshold", default=0.9, type=float,
+                        help="masking threshold in term of metrics (stop masking when metric < threshold * original metric value).")
+    parser.add_argument("--masking_amount", default=0.1, type=float,
+                        help="Amount to heads to masking at each masking step.")
+    parser.add_argument("--metric_name", default="acc", type=str,
+                        help="Metric to use for head masking.")
+
+    parser.add_argument("--max_seq_length", default=128, type=int,
+                        help="The maximum total input sequence length after WordPiece tokenization. \n"
+                             "Sequences longer than this will be truncated, sequences shorter padded.")
+    parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
+
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
+    parser.add_argument("--no_cuda", action='store_true', help="Whether not to use CUDA when available")
+    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
+    args = parser.parse_args()
+
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup devices and distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:
+        torch.cuda.set_device(args.local_rank)
+        args.device = torch.device("cuda", args.local_rank)
+        args.n_gpu = 1
+        torch.distributed.init_process_group(backend='nccl')  # Initializes the distributed backend
+
+    # Setup logging
+    logging.basicConfig(level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, args.n_gpu, bool(args.local_rank != -1)))
+
+    # Set seeds
+    set_seed(args)
+
+    # Prepare GLUE task
+    args.task_name = args.task_name.lower()
+    if args.task_name not in processors:
+        raise ValueError("Task not found: %s" % (args.task_name))
+    processor = processors[args.task_name]()
+    args.output_mode = output_modes[args.task_name]
+    label_list = processor.get_labels()
+    num_labels = len(label_list)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = ""
+    for key in MODEL_CLASSES:
+        if key in args.model_name_or_path.lower():
+            args.model_type = key  # take the first match in model types
+            break
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
+                                          num_labels=num_labels, finetuning_task=args.task_name,
+                                          output_attentions=True)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    # Distributed and parallel training
+    model.to(args.device)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    elif args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Print/save training arguments
+    torch.save(args, os.path.join(args.output_dir, 'run_args.bin'))
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Prepare dataset for the GLUE task
+    eval_data = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=True)
+    if args.data_subset > 0:
+        eval_data = Subset(eval_data, list(range(min(args.data_subset, len(eval_data)))))
+    eval_sampler = SequentialSampler(eval_data) if args.local_rank == -1 else DistributedSampler(eval_data)
+    eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)
+
+
+    # Compute head entropy and importance score
+    compute_heads_importance(args, model, eval_dataloader)
+
+
+    # Try head masking (set heads to zero until the score goes under a threshole)
+    # and head pruning (remove masked heads and see the effect on the network)
+    if args.try_masking and args.masking_threshold > 0.0 and args.masking_threshold < 1.0:
+        head_mask = mask_heads(args, model, eval_dataloader)
+        prune_heads(args, model, eval_dataloader, head_mask)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/run_generation.py b/Optimus/code/examples/run_generation.py
new file mode 100755
index 0000000000000000000000000000000000000000..a2a8f29103172b2772dbe4539e4ada7c9785ab0f
--- /dev/null
+++ b/Optimus/code/examples/run_generation.py
@@ -0,0 +1,195 @@
+#!/usr/bin/env python3
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import logging
+from tqdm import trange
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+
+from pytorch_transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig
+
+from pytorch_transformers import GPT2LMHeadModel, GPT2Tokenizer
+from pytorch_transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
+from pytorch_transformers import XLNetLMHeadModel, XLNetTokenizer
+from pytorch_transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+
+
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'xlnet': (XLNetLMHeadModel, XLNetTokenizer),
+    'transfo-xl': (TransfoXLLMHeadModel, TransfoXLTokenizer),
+}
+
+# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
+# in https://github.com/rusiaaman/XLNet-gen#methodology
+# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+(except for Alexei and Maria) are discovered.
+The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+remainder of the story. 1883 Western Siberia,
+a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+Rasputin has a vision and denounces one of the men as a horse thief. Although his
+father initially slaps him for making such an accusation, Rasputin watches as the
+man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
+
+
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (vocabulary size)
+            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+    top_k = min(top_k, logits.size(-1))  # Safety check
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+    return logits
+
+
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+    context = torch.tensor(context, dtype=torch.long, device=device)
+    context = context.unsqueeze(0).repeat(num_samples, 1)
+    generated = context
+    with torch.no_grad():
+        for _ in trange(length):
+
+            inputs = {'input_ids': generated}
+            if is_xlnet: 
+                # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+                # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+                input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+                perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+                perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+                target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+                target_mapping[0, 0, -1] = 1.0  # predict last token
+                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / temperature
+            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
+    return generated
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--prompt", type=str, default="")
+    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--length", type=int, default=20)
+    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--top_k", type=int, default=0)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    args = parser.parse_args()
+
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+    args.n_gpu = torch.cuda.device_count()
+
+    set_seed(args)
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    model = model_class.from_pretrained(args.model_name_or_path)
+    model.to(args.device)
+    model.eval()
+
+    if args.length < 0 and model.config.max_position_embeddings > 0:
+        args.length = model.config.max_position_embeddings
+    elif 0 < model.config.max_position_embeddings < args.length:
+        args.length = model.config.max_position_embeddings  # No generation bigger than model size 
+    elif args.length < 0:
+        args.length = MAX_LENGTH  # avoid infinite loop
+
+    print(args)
+    while True:
+        raw_text = args.prompt if args.prompt else input("Model prompt >>> ")
+        if args.model_type in ["transfo-xl", "xlnet"]:
+            # Models with memory likes to have a long prompt for short inputs.
+            raw_text = (args.padding_text if args.padding_text else PADDING_TEXT) + raw_text
+        context_tokens = tokenizer.encode(raw_text)
+        out = sample_sequence(
+            model=model,
+            context=context_tokens,
+            length=args.length,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            device=args.device,
+            is_xlnet=bool(args.model_type == "xlnet"),
+        )
+        out = out[0, len(context_tokens):].tolist()
+        text = tokenizer.decode(out, clean_up_tokenization_spaces=True)
+        print(text)
+        if args.prompt:
+            break
+    return text
+
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/examples/run_glue.py b/Optimus/code/examples/run_glue.py
new file mode 100755
index 0000000000000000000000000000000000000000..45a103065127a85ca436b477bd84aa0c7b231f4d
--- /dev/null
+++ b/Optimus/code/examples/run_glue.py
@@ -0,0 +1,547 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa)."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import glob
+import logging
+import os
+import random
+import pdb
+
+cwd = os.getcwd()
+print(f"Current working dir is {cwd}")
+
+import sys
+sys.path.append('./')
+pt_path = os.path.join( cwd, 'pytorch_transformers')
+sys.path.append(pt_path)
+print(f"Pytorch Transformer {pt_path}")
+
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForSequenceClassification, BertTokenizer,
+                                  RobertaConfig,
+                                  RobertaForSequenceClassification,
+                                  RobertaTokenizer,
+                                  XLMConfig, XLMForSequenceClassification,
+                                  XLMTokenizer, XLNetConfig,
+                                  XLNetForSequenceClassification,
+                                  XLNetTokenizer)
+
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+from utils_glue import (compute_metrics, convert_examples_to_features,
+                        output_modes, processors)
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig)), ())
+
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForSequenceClassification, BertTokenizer),
+    'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
+    'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
+    'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            batch = tuple(t.to(args.device) for t in batch)
+            inputs = {'input_ids':      batch[0],
+                      'attention_mask': batch[1],
+                      'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM and RoBERTa don't use segment_ids
+                      'labels':         batch[3]}
+            outputs = model(**inputs)
+            tmp_output, pooled_fea = outputs
+            loss = tmp_output[0]
+
+            # outputs = model(**inputs)
+            # loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                scheduler.step()  # Update learning rate schedule
+                optimizer.step()
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
+    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)
+
+    results = {}
+    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
+        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
+
+        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(eval_output_dir)
+
+        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+        # Note that DistributedSampler samples randomly
+        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+        # Eval!
+        logger.info("***** Running evaluation {} *****".format(prefix))
+        logger.info("  Num examples = %d", len(eval_dataset))
+        logger.info("  Batch size = %d", args.eval_batch_size)
+        eval_loss = 0.0
+        nb_eval_steps = 0
+        preds = None
+        out_label_ids = None
+
+        latent_features = []
+        latent_labels = []
+
+        for batch in tqdm(eval_dataloader, desc="Evaluating"):
+            model.eval()
+            batch = tuple(t.to(args.device) for t in batch)
+
+            with torch.no_grad():
+                inputs = {'input_ids':      batch[0],
+                          'attention_mask': batch[1],
+                          'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM and RoBERTa don't use segment_ids
+                          'labels':         batch[3]}
+                outputs = model(**inputs)
+                tmp_output, pooled_fea = outputs
+                tmp_eval_loss, logits = tmp_output
+
+                eval_loss += tmp_eval_loss.mean().item()
+            nb_eval_steps += 1
+            if preds is None:
+                preds = logits.detach().cpu().numpy()
+                out_label_ids = inputs['labels'].detach().cpu().numpy()
+            else:
+                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
+                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
+
+                if args.collect_feature:
+                    latent_features.append(pooled_fea)
+
+        if args.collect_feature:
+            latent_features = torch.cat(latent_features, dim=0)
+            latent_labels = out_label_ids
+            return latent_features, latent_labels
+ 
+
+        eval_loss = eval_loss / nb_eval_steps
+        if args.output_mode == "classification":
+            preds = np.argmax(preds, axis=1)
+        elif args.output_mode == "regression":
+            preds = np.squeeze(preds)
+        result = compute_metrics(eval_task, preds, out_label_ids)
+        results.update(result)
+
+        output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results {} *****".format(prefix))
+            for key in sorted(result.keys()):
+                logger.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return results
+
+
+def load_and_cache_examples(args, task, tokenizer, evaluate=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    processor = processors[task]()
+    output_mode = output_modes[task]
+    # Load data features from cache or dataset file
+    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}_{}'.format(
+        'dev' if evaluate else 'train',
+        list(filter(None, args.model_name_or_path.split('/'))).pop(),
+        str(args.max_seq_length),
+        str(args.percentage_per_label),
+        str(task)))
+
+    if False: # os.path.exists(cached_features_file):
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", args.data_dir)
+        label_list = processor.get_labels()
+        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
+            # HACK(label indices are swapped in RoBERTa pretrained model)
+            label_list[1], label_list[2] = label_list[2], label_list[1] 
+        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir, args.percentage_per_label, args.sample_per_label)
+        features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer, output_mode,
+            cls_token_at_end=bool(args.model_type in ['xlnet']),            # xlnet has a cls token at the end
+            cls_token=tokenizer.cls_token,
+            cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
+            sep_token=tokenizer.sep_token,
+            sep_token_extra=bool(args.model_type in ['roberta']),           # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
+            pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
+            pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
+            pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
+        )
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0 and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+    if output_mode == "classification":
+        all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
+    elif output_mode == "regression":
+        all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.float)
+
+    dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--data_dir", default=None, type=str, required=True,
+                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--task_name", default=None, type=str, required=True,
+                        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+    parser.add_argument("--max_seq_length", default=128, type=int,
+                        help="The maximum total input sequence length after tokenization. Sequences longer "
+                             "than this will be truncated, sequences shorter will be padded.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--percentage_per_label", type=float, default=1.0,
+                        help="Set this value (<1.0), if you are using a subset of training dataset.")
+    parser.add_argument("--sample_per_label", type=int, default=-1,
+                        help="Set this value, if you are using a subset of training dataset, and a fixed number of samples are specified.")                        
+    parser.add_argument("--use_freeze", action='store_true',
+                        help="Set this flag if you are not updating the model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=99,
+                        help="random seed for initialization")
+    parser.add_argument('--collect_feature', action='store_true',
+                        help="Collect feature on  training or evaluation sets")
+                        
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Prepare GLUE task
+    args.task_name = args.task_name.lower()
+    if args.task_name not in processors:
+        raise ValueError("Task not found: %s" % (args.task_name))
+    processor = processors[args.task_name]()
+    args.output_mode = output_modes[args.task_name]
+    label_list = processor.get_labels()
+    num_labels = len(label_list)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+
+    model.use_freeze = args.use_freeze
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+
+        if args.collect_feature:
+            global_step = 0
+            latent_features, latent_labels = evaluate(args, model, tokenizer, prefix=global_step)
+            cached_features_file= os.path.join(args.output_dir, 'latent_features')
+            torch.save([latent_features,latent_labels], cached_features_file)
+            return
+
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+        for checkpoint in checkpoints[-1]:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/run_glue_data_integration.py b/Optimus/code/examples/run_glue_data_integration.py
new file mode 100755
index 0000000000000000000000000000000000000000..a96e4292bdcd20a4dd8b7361b24ac851a52d2165
--- /dev/null
+++ b/Optimus/code/examples/run_glue_data_integration.py
@@ -0,0 +1,231 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa)."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import glob
+import logging
+import os
+import random
+import pdb
+
+cwd = os.getcwd()
+print(f"Current working dir is {cwd}")
+
+import sys
+sys.path.append('./')
+pt_path = os.path.join( cwd, 'pytorch_transformers')
+sys.path.append(pt_path)
+print(f"Pytorch Transformer {pt_path}")
+
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForSequenceClassification, BertTokenizer,BertForSequenceClassificationLatentConnector,
+                                  RobertaConfig,
+                                  RobertaForSequenceClassification,
+                                  RobertaTokenizer,
+                                  XLMConfig, XLMForSequenceClassification,
+                                  XLMTokenizer, XLNetConfig,
+                                  XLNetForSequenceClassification,
+                                  XLNetTokenizer)
+
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+from utils_glue import (compute_metrics, convert_examples_to_features,
+                        output_modes, processors)
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig)), ())
+
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForSequenceClassificationLatentConnector, BertTokenizer),
+    'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
+    'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
+    'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+def load_and_cache_examples(args, task, tokenizer, file_txt, evaluate=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    processor = processors[task]()
+    output_mode = output_modes[task]
+
+    label_list = processor.get_labels()
+    if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
+        # HACK(label indices are swapped in RoBERTa pretrained model)
+        label_list[1], label_list[2] = label_list[2], label_list[1] 
+    examples = processor.get_train_examples(args.data_dir, args.percentage_per_label, args.sample_per_label)
+    
+    # Chunyuan: convert examples into text lines here
+
+    # write data in a file. 
+    for item in examples:
+        # pdb.set_trace()
+        if item.text_b:
+            line = item.text_a + " " + tokenizer.sep_token + " " + item.text_b + "\n"
+        else:
+            line = item.text_a + " \n"
+        file_txt.write(line) 
+
+    file_txt.close()
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters 
+    parser.add_argument("--data_dir", default=None, type=str, required=True,
+                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+    parser.add_argument("--max_seq_length", default=128, type=int,
+                        help="The maximum total input sequence length after tokenization. Sequences longer "
+                             "than this will be truncated, sequences shorter will be padded.")
+
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--percentage_per_label", type=float, default=1.0,
+                        help="Set this value (<1.0), if you are using a subset of training dataset.")
+    parser.add_argument("--sample_per_label", type=int, default=-1,
+                        help="Set this value, if you are using a subset of training dataset, and a fixed number of samples are specified.")                        
+    parser.add_argument("--use_freeze", action='store_true',
+                        help="Set this flag if you are not updating the model.")
+
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+
+    # Set seed
+    set_seed(args)
+
+
+    ## Tokenizer 
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    if not os.path.isdir(args.output_dir):
+        os.mkdir(args.output_dir)
+
+    logger.info("Parameters %s", args)
+
+    # Prepare GLUE task
+    TASK_NAME = ['CoLA', 'SST-2', 'MRPC', 'STS-B', 'QQP', 'MNLI', 'QNLI', 'RTE', 'WNLI']
+    parent_path = args.data_dir
+
+    for task_ in TASK_NAME:
+
+        args.data_dir = os.path.join(parent_path, task_) 
+        args.task_name = task_.lower()
+
+        if args.task_name not in processors:
+            raise ValueError("Task not found: %s" % (args.task_name))
+        processor = processors[args.task_name]()
+        args.output_mode = output_modes[args.task_name]
+
+        args.output_file_name = os.path.join(args.output_dir, f"{args.task_name}.txt")
+        logger.info("Dataset input file at %s", args.data_dir)
+        logger.info("Dataset ouput file at %s", args.output_file_name)
+
+        file_txt = open(args.output_file_name, "w") 
+
+        load_and_cache_examples(args, args.task_name, tokenizer, file_txt, evaluate=False)
+
+        
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/run_glue_vae.py b/Optimus/code/examples/run_glue_vae.py
new file mode 100755
index 0000000000000000000000000000000000000000..6db8c104ed1cfcc00e3c0c141e35519ec9a2b900
--- /dev/null
+++ b/Optimus/code/examples/run_glue_vae.py
@@ -0,0 +1,565 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa)."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import glob
+import logging
+import os
+import random
+
+cwd = os.getcwd()
+print(f"Current working dir is {cwd}")
+
+import sys
+sys.path.append('./')
+pt_path = os.path.join( cwd, 'pytorch_transformers')
+sys.path.append(pt_path)
+print(f"Pytorch Transformer {pt_path}")
+
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForSequenceClassification, BertTokenizer,BertForSequenceClassificationLatentConnector,
+                                  RobertaConfig,
+                                  RobertaForSequenceClassification,
+                                  RobertaTokenizer,
+                                  XLMConfig, XLMForSequenceClassification,
+                                  XLMTokenizer, XLNetConfig,
+                                  XLNetForSequenceClassification,
+                                  XLNetTokenizer)
+
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+from utils_glue import (compute_metrics, convert_examples_to_features,
+                        output_modes, processors)
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig)), ())
+
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForSequenceClassificationLatentConnector, BertTokenizer),
+    'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
+    'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
+    'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for epoch in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            batch = tuple(t.to(args.device) for t in batch)
+            inputs = {'input_ids':      batch[0],
+                      'attention_mask': batch[1],
+                      'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM and RoBERTa don't use segment_ids
+                      'labels':         batch[3]}
+            outputs = model(**inputs)
+            tmp_output, pooled_fea = outputs
+            loss = tmp_output[0]
+
+            # 
+            # loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.use_philly:
+                print("PROGRESS: {}%".format(round(100 * (step +  epoch*len(epoch_iterator) ) /(int(args.num_train_epochs) *  len(epoch_iterator)) , 4))) 
+                print("EVALERR: {}%".format(loss)) 
+
+
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                scheduler.step()  # Update learning rate schedule
+                optimizer.step()
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
+    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)
+
+    results = {}
+    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
+        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
+
+        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(eval_output_dir)
+
+        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+        # Note that DistributedSampler samples randomly
+        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+        # Eval!
+        logger.info("***** Running evaluation {} *****".format(prefix))
+        logger.info("  Num examples = %d", len(eval_dataset))
+        logger.info("  Batch size = %d", args.eval_batch_size)
+        eval_loss = 0.0
+        nb_eval_steps = 0
+        preds = None
+        out_label_ids = None
+
+        latent_features = []
+        latent_labels = []
+
+        for batch in tqdm(eval_dataloader, desc="Evaluating"):
+            model.eval()
+            batch = tuple(t.to(args.device) for t in batch)
+
+            with torch.no_grad():
+                inputs = {'input_ids':      batch[0],
+                          'attention_mask': batch[1],
+                          'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM and RoBERTa don't use segment_ids
+                          'labels':         batch[3]}
+                outputs = model(**inputs)
+                tmp_output, pooled_fea = outputs
+                tmp_eval_loss, logits = tmp_output
+
+                eval_loss += tmp_eval_loss.mean().item()
+            nb_eval_steps += 1
+            if preds is None:
+                preds = logits.detach().cpu().numpy()
+                out_label_ids = inputs['labels'].detach().cpu().numpy()
+            else:
+                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
+                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
+
+                if args.collect_feature:
+                    latent_features.append(pooled_fea)
+
+        if args.collect_feature:
+            latent_features = torch.cat(latent_features, dim=0)
+            latent_labels = out_label_ids
+            return latent_features, latent_labels
+ 
+
+
+        eval_loss = eval_loss / nb_eval_steps
+        if args.output_mode == "classification":
+            preds = np.argmax(preds, axis=1)
+        elif args.output_mode == "regression":
+            preds = np.squeeze(preds)
+        result = compute_metrics(eval_task, preds, out_label_ids)
+        results.update(result)
+
+        output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results {} *****".format(prefix))
+            for key in sorted(result.keys()):
+                logger.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return results
+
+
+def load_and_cache_examples(args, task, tokenizer, evaluate=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    processor = processors[task]()
+    output_mode = output_modes[task]
+    # Load data features from cache or dataset file
+    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}_{}'.format(
+        'dev' if evaluate else 'train',
+        list(filter(None, args.model_name_or_path.split('/'))).pop(),
+        str(args.max_seq_length),
+        str(args.percentage_per_label),
+        str(task)))
+
+    if os.path.exists(cached_features_file):
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", args.data_dir)
+        label_list = processor.get_labels()
+        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
+            # HACK(label indices are swapped in RoBERTa pretrained model)
+            label_list[1], label_list[2] = label_list[2], label_list[1] 
+        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir, args.percentage_per_label, args.sample_per_label)
+        features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer, output_mode,
+            cls_token_at_end=bool(args.model_type in ['xlnet']),            # xlnet has a cls token at the end
+            cls_token=tokenizer.cls_token,
+            cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
+            sep_token=tokenizer.sep_token,
+            sep_token_extra=bool(args.model_type in ['roberta']),           # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
+            pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
+            pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
+            pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
+        )
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0 and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+    if output_mode == "classification":
+        all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
+    elif output_mode == "regression":
+        all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.float)
+
+    dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--checkpoint_dir", default=None, type=str, required=True,
+                        help="The directory where checkpoints are saved.")   
+    parser.add_argument("--data_dir", default=None, type=str, required=True,
+                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--task_name", default=None, type=str, required=True,
+                        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+
+    parser.add_argument("--latent_size", default=32, type=int, help="Latent space dimension.")
+    parser.add_argument('--gloabl_step_eval', type=int, default=661,
+                        help="Evaluate the results at the given global step")
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+    parser.add_argument("--max_seq_length", default=128, type=int,
+                        help="The maximum total input sequence length after tokenization. Sequences longer "
+                             "than this will be truncated, sequences shorter will be padded.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--percentage_per_label", type=float, default=1.0,
+                        help="Set this value (<1.0), if you are using a subset of training dataset.")
+    parser.add_argument("--sample_per_label", type=int, default=-1,
+                        help="Set this value, if you are using a subset of training dataset, and a fixed number of samples are specified.")                        
+    parser.add_argument("--use_freeze", action='store_true',
+                        help="Set this flag if you are not updating the model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--use_philly", action='store_true',
+                        help="Use Philly for computing.")
+    parser.add_argument('--collect_feature', action='store_true',
+                        help="Collect feature on  training or evaluation sets")
+                        
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Prepare GLUE task
+    args.task_name = args.task_name.lower()
+    if args.task_name not in processors:
+        raise ValueError("Task not found: %s" % (args.task_name))
+    processor = processors[args.task_name]()
+    args.output_mode = output_modes[args.task_name]
+    label_list = processor.get_labels()
+    num_labels = len(label_list)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+    
+
+    global_step = args.gloabl_step_eval
+    encoder_dir = os.path.join(args.checkpoint_dir, 'checkpoint-encoder-{}'.format(global_step))
+
+    ## Encoder 
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(encoder_dir, config=config, latent_size = args.latent_size)
+
+    model.use_freeze = args.use_freeze
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir, latent_size = args.latent_size)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+
+        if args.collect_feature:
+            global_step = 0
+            latent_features, latent_labels = evaluate(args, model, tokenizer, prefix=global_step)
+            cached_features_file= os.path.join(args.output_dir, 'latent_features_vae')
+            torch.save([latent_features,latent_labels], cached_features_file)
+            return
+
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint, latent_size = args.latent_size)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/run_lm_finetuning.py b/Optimus/code/examples/run_lm_finetuning.py
new file mode 100755
index 0000000000000000000000000000000000000000..e3aa2d00d3f27eb6c01c691200987fcd18adcfcc
--- /dev/null
+++ b/Optimus/code/examples/run_lm_finetuning.py
@@ -0,0 +1,507 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+import pdb
+
+import sys
+sys.path.insert(0, '.')
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForMaskedLM, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+
+
+import pdb
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+}
+
+
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    dataset = TextDataset(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).to(torch.uint8)
+    labels[masked_indices==1] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).to(torch.uint8) & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).to(torch.uint8) & masked_indices & ~indices_replaced
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return inputs, labels
+
+
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+            model.train()
+
+            # pdb.set_trace()
+            outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    model.eval()
+
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        batch = batch.to(args.device)
+
+        with torch.no_grad():
+            outputs = model(batch, masked_lm_labels=batch) if args.mlm else model(batch, labels=batch)
+            lm_loss = outputs[0]
+            eval_loss += lm_loss.mean().item()
+        nb_eval_steps += 1
+
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity = torch.exp(torch.tensor(eval_loss))
+
+    result = {
+        "perplexity": perplexity
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+
+    parser.add_argument("--model_type", default="bert", type=str,
+                        help="The model architecture to be fine-tuned.")
+    parser.add_argument("--model_name_or_path", default="bert-base-cased", type=str,
+                        help="The model checkpoint for weights initialization.")
+
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Run evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=1, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument('--logging_steps', type=int, default=100,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=100,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if args.model_type in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
+
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer.max_len_single_sentence)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+    model.to(args.device)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Training
+    if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/run_multiple_choice.py b/Optimus/code/examples/run_multiple_choice.py
new file mode 100755
index 0000000000000000000000000000000000000000..05f9a48f502ddd36d17aa9a9a12462da5fa4012e
--- /dev/null
+++ b/Optimus/code/examples/run_multiple_choice.py
@@ -0,0 +1,542 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for multiple choice (Bert, Roberta, XLNet)."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import glob
+import logging
+import os
+import random
+
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForMultipleChoice, BertTokenizer,
+                                  XLNetConfig, XLNetForMultipleChoice,
+                                  XLNetTokenizer, RobertaConfig,
+                                  RobertaForMultipleChoice, RobertaTokenizer)
+
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+from utils_multiple_choice import (convert_examples_to_features, processors)
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, RobertaConfig)), ())
+
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForMultipleChoice, BertTokenizer),
+    'xlnet': (XLNetConfig, XLNetForMultipleChoice, XLNetTokenizer),
+    'roberta': (RobertaConfig, RobertaForMultipleChoice, RobertaTokenizer)
+}
+
+def select_field(features, field):
+    return [
+        [
+            choice[field]
+            for choice in feature.choices_features
+        ]
+        for feature in features
+    ]
+
+
+def simple_accuracy(preds, labels):
+    return (preds == labels).mean()
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    best_dev_acc, best_dev_loss = 0.0, 99999999999.0
+    best_steps = 0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            batch = tuple(t.to(args.device) for t in batch)
+            inputs = {'input_ids':      batch[0],
+                      'attention_mask': batch[1],
+                      'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM don't use segment_ids
+                      'labels':         batch[3]}
+            outputs = model(**inputs)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                        if results["eval_acc"] > best_dev_acc:
+                            best_dev_acc = results["eval_acc"]
+                            best_dev_loss = results["eval_loss"]
+                            best_steps = global_step
+                            if args.do_test:
+                                results_test = evaluate(args, model, tokenizer, test=True)
+                                for key, value in results_test.items():
+                                    tb_writer.add_scalar('test_{}'.format(key), value, global_step)
+                                logger.info("test acc: %s, loss: %s, global steps: %s", str(results_test['eval_acc']), str(results_test['eval_loss']), str(global_step))
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logger.info("Average loss: %s at global step: %s", str((tr_loss - logging_loss)/args.logging_steps), str(global_step))
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_vocabulary(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step, best_steps
+
+
+def evaluate(args, model, tokenizer, prefix="", test=False):
+    eval_task_names = (args.task_name,)
+    eval_outputs_dirs = (args.output_dir,)
+
+    results = {}
+    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
+        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=not test, test=test)
+
+        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(eval_output_dir)
+
+        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+        # Note that DistributedSampler samples randomly
+        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+        # Eval!
+        logger.info("***** Running evaluation {} *****".format(prefix))
+        logger.info("  Num examples = %d", len(eval_dataset))
+        logger.info("  Batch size = %d", args.eval_batch_size)
+        eval_loss = 0.0
+        nb_eval_steps = 0
+        preds = None
+        out_label_ids = None
+        for batch in tqdm(eval_dataloader, desc="Evaluating"):
+            model.eval()
+            batch = tuple(t.to(args.device) for t in batch)
+
+            with torch.no_grad():
+                inputs = {'input_ids':      batch[0],
+                          'attention_mask': batch[1],
+                          'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM don't use segment_ids
+                          'labels':         batch[3]}
+                outputs = model(**inputs)
+                tmp_eval_loss, logits = outputs[:2]
+
+                eval_loss += tmp_eval_loss.mean().item()
+            nb_eval_steps += 1
+            if preds is None:
+                preds = logits.detach().cpu().numpy()
+                out_label_ids = inputs['labels'].detach().cpu().numpy()
+            else:
+                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
+                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
+
+        eval_loss = eval_loss / nb_eval_steps
+        preds = np.argmax(preds, axis=1)
+        acc = simple_accuracy(preds, out_label_ids)
+        result = {"eval_acc": acc, "eval_loss": eval_loss}
+        results.update(result)
+
+        output_eval_file = os.path.join(eval_output_dir, "is_test_" + str(test).lower() + "_eval_results.txt")
+
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results {} *****".format(str(prefix) + " is test:" + str(test)))
+            writer.write("model           =%s\n" % str(args.model_name_or_path))
+            writer.write("total batch size=%d\n" % (args.per_gpu_train_batch_size * args.gradient_accumulation_steps *
+                         (torch.distributed.get_world_size() if args.local_rank != -1 else 1)))
+            writer.write("train num epochs=%d\n" % args.num_train_epochs)
+            writer.write("fp16            =%s\n" % args.fp16)
+            writer.write("max seq length  =%d\n" % args.max_seq_length)
+            for key in sorted(result.keys()):
+                logger.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+    return results
+
+
+def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    processor = processors[task]()
+    # Load data features from cache or dataset file
+    if evaluate:
+        cached_mode = 'dev'
+    elif test:
+        cached_mode = 'test'
+    else:
+        cached_mode = 'train'
+    assert (evaluate == True and test == True) == False
+    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}'.format(
+        cached_mode,
+        list(filter(None, args.model_name_or_path.split('/'))).pop(),
+        str(args.max_seq_length),
+        str(task)))
+    if os.path.exists(cached_features_file):
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", args.data_dir)
+        label_list = processor.get_labels()
+        if evaluate:
+            examples = processor.get_dev_examples(args.data_dir)
+        elif test:
+            examples = processor.get_test_examples(args.data_dir)
+        else:
+            examples = processor.get_train_examples(args.data_dir)
+        logger.info("Training number: %s", str(len(examples)))
+        features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer,
+            cls_token_at_end=bool(args.model_type in ['xlnet']),            # xlnet has a cls token at the end
+            cls_token=tokenizer.cls_token,
+            sep_token=tokenizer.sep_token,
+            sep_token_extra=bool(args.model_type in ['roberta']),
+            cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
+            pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
+            pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0)
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor(select_field(features, 'input_ids'), dtype=torch.long)
+    all_input_mask = torch.tensor(select_field(features, 'input_mask'), dtype=torch.long)
+    all_segment_ids = torch.tensor(select_field(features, 'segment_ids'), dtype=torch.long)
+    all_label_ids = torch.tensor([f.label for f in features], dtype=torch.long)
+
+    dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--data_dir", default=None, type=str, required=True,
+                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--task_name", default=None, type=str, required=True,
+                        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+    parser.add_argument("--max_seq_length", default=128, type=int,
+                        help="The maximum total input sequence length after tokenization. Sequences longer "
+                             "than this will be truncated, sequences shorter will be padded.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--do_test", action='store_true', help='Whether to run test on the test set')
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Prepare GLUE task
+    args.task_name = args.task_name.lower()
+    if args.task_name not in processors:
+        raise ValueError("Task not found: %s" % (args.task_name))
+    processor = processors[args.task_name]()
+    label_list = processor.get_labels()
+    num_labels = len(label_list)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+    best_steps = 0
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
+        global_step, tr_loss, best_steps = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        model.to(args.device)
+
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        if not args.do_train:
+            args.output_dir = args.model_name_or_path
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    if args.do_test and args.local_rank in [-1, 0]:
+        if not args.do_train:
+            args.output_dir = args.model_name_or_path
+        checkpoints = [args.output_dir]
+        # if args.eval_all_checkpoints: # can not use this to do test!!
+        #     checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+        #     logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step, test=True)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+    if best_steps:
+        logger.info("best steps of eval acc is the following checkpoints: %s", best_steps)
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/run_squad.py b/Optimus/code/examples/run_squad.py
new file mode 100755
index 0000000000000000000000000000000000000000..cc4eda306ccde2508d808a8b4a0b1b50aad47a37
--- /dev/null
+++ b/Optimus/code/examples/run_squad.py
@@ -0,0 +1,533 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for question-answering on SQuAD (Bert, XLM, XLNet)."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import logging
+import os
+import random
+import glob
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+from tensorboardX import SummaryWriter
+
+from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForQuestionAnswering, BertTokenizer,
+                                  XLMConfig, XLMForQuestionAnswering,
+                                  XLMTokenizer, XLNetConfig,
+                                  XLNetForQuestionAnswering,
+                                  XLNetTokenizer)
+
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+from utils_squad import (read_squad_examples, convert_examples_to_features,
+                         RawResult, write_predictions,
+                         RawResultExtended, write_predictions_extended)
+
+# The follwing import is the official SQuAD evaluation script (2.0).
+# You can remove it from the dependencies if you are using this script outside of the library
+# We've added it here for automated tests (see examples/test_examples.py file)
+from utils_squad_evaluate import EVAL_OPTS, main as evaluate_on_squad
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \
+                  for conf in (BertConfig, XLNetConfig, XLMConfig)), ())
+
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForQuestionAnswering, BertTokenizer),
+    'xlnet': (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
+    'xlm': (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
+}
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+def to_list(tensor):
+    return tensor.detach().cpu().tolist()
+
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            batch = tuple(t.to(args.device) for t in batch)
+            inputs = {'input_ids':       batch[0],
+                      'attention_mask':  batch[1], 
+                      'token_type_ids':  None if args.model_type == 'xlm' else batch[2],  
+                      'start_positions': batch[3], 
+                      'end_positions':   batch[4]}
+            if args.model_type in ['xlnet', 'xlm']:
+                inputs.update({'cls_index': batch[5],
+                               'p_mask':       batch[6]})
+            outputs = model(**inputs)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
+
+    if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(dataset) if args.local_rank == -1 else DistributedSampler(dataset)
+    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    all_results = []
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        model.eval()
+        batch = tuple(t.to(args.device) for t in batch)
+        with torch.no_grad():
+            inputs = {'input_ids':      batch[0],
+                      'attention_mask': batch[1],
+                      'token_type_ids': None if args.model_type == 'xlm' else batch[2]  # XLM don't use segment_ids
+                      }
+            example_indices = batch[3]
+            if args.model_type in ['xlnet', 'xlm']:
+                inputs.update({'cls_index': batch[4],
+                               'p_mask':    batch[5]})
+            outputs = model(**inputs)
+
+        for i, example_index in enumerate(example_indices):
+            eval_feature = features[example_index.item()]
+            unique_id = int(eval_feature.unique_id)
+            if args.model_type in ['xlnet', 'xlm']:
+                # XLNet uses a more complex post-processing procedure
+                result = RawResultExtended(unique_id            = unique_id,
+                                           start_top_log_probs  = to_list(outputs[0][i]),
+                                           start_top_index      = to_list(outputs[1][i]),
+                                           end_top_log_probs    = to_list(outputs[2][i]),
+                                           end_top_index        = to_list(outputs[3][i]),
+                                           cls_logits           = to_list(outputs[4][i]))
+            else:
+                result = RawResult(unique_id    = unique_id,
+                                   start_logits = to_list(outputs[0][i]),
+                                   end_logits   = to_list(outputs[1][i]))
+            all_results.append(result)
+
+    # Compute predictions
+    output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
+    output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
+    if args.version_2_with_negative:
+        output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
+    else:
+        output_null_log_odds_file = None
+
+    if args.model_type in ['xlnet', 'xlm']:
+        # XLNet uses a more complex post-processing procedure
+        write_predictions_extended(examples, features, all_results, args.n_best_size,
+                        args.max_answer_length, output_prediction_file,
+                        output_nbest_file, output_null_log_odds_file, args.predict_file,
+                        model.config.start_n_top, model.config.end_n_top,
+                        args.version_2_with_negative, tokenizer, args.verbose_logging)
+    else:
+        write_predictions(examples, features, all_results, args.n_best_size,
+                        args.max_answer_length, args.do_lower_case, output_prediction_file,
+                        output_nbest_file, output_null_log_odds_file, args.verbose_logging,
+                        args.version_2_with_negative, args.null_score_diff_threshold)
+
+    # Evaluate with the official SQuAD script
+    evaluate_options = EVAL_OPTS(data_file=args.predict_file,
+                                 pred_file=output_prediction_file,
+                                 na_prob_file=output_null_log_odds_file)
+    results = evaluate_on_squad(evaluate_options)
+    return results
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Load data features from cache or dataset file
+    input_file = args.predict_file if evaluate else args.train_file
+    cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
+        'dev' if evaluate else 'train',
+        list(filter(None, args.model_name_or_path.split('/'))).pop(),
+        str(args.max_seq_length)))
+    if os.path.exists(cached_features_file) and not args.overwrite_cache and not output_examples:
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", input_file)
+        examples = read_squad_examples(input_file=input_file,
+                                                is_training=not evaluate,
+                                                version_2_with_negative=args.version_2_with_negative)
+        features = convert_examples_to_features(examples=examples,
+                                                tokenizer=tokenizer,
+                                                max_seq_length=args.max_seq_length,
+                                                doc_stride=args.doc_stride,
+                                                max_query_length=args.max_query_length,
+                                                is_training=not evaluate)
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0 and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+    all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
+    all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)
+    if evaluate:
+        all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_example_index, all_cls_index, all_p_mask)
+    else:
+        all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
+        all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_start_positions, all_end_positions,
+                                all_cls_index, all_p_mask)
+
+    if output_examples:
+        return dataset, examples, features
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_file", default=None, type=str, required=True,
+                        help="SQuAD json for training. E.g., train-v1.1.json")
+    parser.add_argument("--predict_file", default=None, type=str, required=True,
+                        help="SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model checkpoints and predictions will be written.")
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+
+    parser.add_argument('--version_2_with_negative', action='store_true',
+                        help='If true, the SQuAD examples contain some that do not have an answer.')
+    parser.add_argument('--null_score_diff_threshold', type=float, default=0.0,
+                        help="If null_score - best_non_null is greater than the threshold predict null.")
+
+    parser.add_argument("--max_seq_length", default=384, type=int,
+                        help="The maximum total input sequence length after WordPiece tokenization. Sequences "
+                             "longer than this will be truncated, and sequences shorter than this will be padded.")
+    parser.add_argument("--doc_stride", default=128, type=int,
+                        help="When splitting up a long document into chunks, how much stride to take between chunks.")
+    parser.add_argument("--max_query_length", default=64, type=int,
+                        help="The maximum number of tokens for the question. Questions longer than this will "
+                             "be truncated to this length.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--n_best_size", default=20, type=int,
+                        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+    parser.add_argument("--max_answer_length", default=30, type=int,
+                        help="The maximum length of an answer that can be generated. This is needed because the start "
+                             "and end predictions are not conditioned on one another.")
+    parser.add_argument("--verbose_logging", action='store_true',
+                        help="If true, all of the warnings related to data processing will be printed. "
+                             "A number of warnings are expected for a normal SQuAD evaluation.")
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Whether not to use CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="local_rank for distributed training on gpus")
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Save the trained model and the tokenizer
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+
+    # Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce model loading logs
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+        for checkpoint in checkpoints:
+            # Reload the model
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+
+            # Evaluate
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+
+            result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items())
+            results.update(result)
+
+    logger.info("Results: {}".format(results))
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/examples/test_examples.py b/Optimus/code/examples/test_examples.py
new file mode 100755
index 0000000000000000000000000000000000000000..b04d722b7b08f61db8bbdbd7769a6d9541b205ad
--- /dev/null
+++ b/Optimus/code/examples/test_examples.py
@@ -0,0 +1,111 @@
+# coding=utf-8
+# Copyright 2018 HuggingFace Inc..
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+import unittest
+import argparse
+import logging
+
+try:
+    # python 3.4+ can use builtin unittest.mock instead of mock package
+    from unittest.mock import patch
+except ImportError:
+    from mock import patch
+
+import run_glue
+import run_squad
+import run_generation
+
+logging.basicConfig(level=logging.DEBUG)
+
+logger = logging.getLogger()
+
+def get_setup_file():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-f')
+    args = parser.parse_args()
+    return args.f
+
+class ExamplesTests(unittest.TestCase):
+
+    def test_run_glue(self):
+        stream_handler = logging.StreamHandler(sys.stdout)
+        logger.addHandler(stream_handler)
+
+        testargs = ["run_glue.py",
+                    "--data_dir=./examples/tests_samples/MRPC/",
+                    "--task_name=mrpc",
+                    "--do_train",
+                    "--do_eval",
+                    "--output_dir=./examples/tests_samples/temp_dir",
+                    "--per_gpu_train_batch_size=2",
+                    "--per_gpu_eval_batch_size=1",
+                    "--learning_rate=1e-4",
+                    "--max_steps=10",
+                    "--warmup_steps=2",
+                    "--overwrite_output_dir",
+                    "--seed=42"]
+        model_type, model_name = ("--model_type=bert",
+                                  "--model_name_or_path=bert-base-uncased")
+        with patch.object(sys, 'argv', testargs + [model_type, model_name]):
+            result = run_glue.main()
+            for value in result.values():
+                self.assertGreaterEqual(value, 0.75)
+
+    def test_run_squad(self):
+        stream_handler = logging.StreamHandler(sys.stdout)
+        logger.addHandler(stream_handler)
+
+        testargs = ["run_squad.py",
+                    "--train_file=./examples/tests_samples/SQUAD/dev-v2.0-small.json",
+                    "--predict_file=./examples/tests_samples/SQUAD/dev-v2.0-small.json",
+                    "--model_name=bert-base-uncased",
+                    "--output_dir=./examples/tests_samples/temp_dir",
+                    "--max_steps=10",
+                    "--warmup_steps=2",
+                    "--do_train",
+                    "--do_eval",
+                    "--version_2_with_negative",
+                    "--learning_rate=2e-4",
+                    "--per_gpu_train_batch_size=2",
+                    "--per_gpu_eval_batch_size=1",
+                    "--overwrite_output_dir",
+                    "--seed=42"]
+        model_type, model_name = ("--model_type=bert",
+                                  "--model_name_or_path=bert-base-uncased")
+        with patch.object(sys, 'argv', testargs + [model_type, model_name]):
+            result = run_squad.main()
+            self.assertGreaterEqual(result['f1'], 30)
+            self.assertGreaterEqual(result['exact'], 30)
+
+    def test_generation(self):
+        stream_handler = logging.StreamHandler(sys.stdout)
+        logger.addHandler(stream_handler)
+
+        testargs = ["run_generation.py",
+                    "--prompt=Hello",
+                    "--length=10",
+                    "--seed=42"]
+        model_type, model_name = ("--model_type=openai-gpt",
+                                  "--model_name_or_path=openai-gpt")
+        with patch.object(sys, 'argv', testargs + [model_type, model_name]):
+            result = run_generation.main()
+            self.assertGreaterEqual(len(result), 10)
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/examples/tests_samples/.gitignore b/Optimus/code/examples/tests_samples/.gitignore
new file mode 100755
index 0000000000000000000000000000000000000000..c8ce21fe2411c3dc3022e26ccf4e11cc6b58a01d
--- /dev/null
+++ b/Optimus/code/examples/tests_samples/.gitignore
@@ -0,0 +1,6 @@
+*.*
+cache*
+temp*
+!*.tsv
+!*.json
+!.gitignore
\ No newline at end of file
diff --git a/Optimus/code/examples/tests_samples/MRPC/dev.tsv b/Optimus/code/examples/tests_samples/MRPC/dev.tsv
new file mode 100755
index 0000000000000000000000000000000000000000..5b814856c63f44ef8c082726ae19285a4faec26c
--- /dev/null
+++ b/Optimus/code/examples/tests_samples/MRPC/dev.tsv
@@ -0,0 +1,7 @@
+﻿Quality	#1 ID	#2 ID	#1 String	#2 String
+1	1355540	1355592	He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .	" The foodservice pie business does not fit our long-term growth strategy .
+0	2029631	2029565	Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .	His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .
+0	487993	487952	The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .	The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .
+1	1989515	1989458	The AFL-CIO is waiting until October to decide if it will endorse a candidate .	The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
+0	1783137	1782659	No dates have been set for the civil or the criminal trial .	No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty .
+1	3039165	3039036	Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .	It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
diff --git a/Optimus/code/examples/tests_samples/MRPC/train.tsv b/Optimus/code/examples/tests_samples/MRPC/train.tsv
new file mode 100755
index 0000000000000000000000000000000000000000..5b814856c63f44ef8c082726ae19285a4faec26c
--- /dev/null
+++ b/Optimus/code/examples/tests_samples/MRPC/train.tsv
@@ -0,0 +1,7 @@
+﻿Quality	#1 ID	#2 ID	#1 String	#2 String
+1	1355540	1355592	He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .	" The foodservice pie business does not fit our long-term growth strategy .
+0	2029631	2029565	Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .	His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .
+0	487993	487952	The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .	The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .
+1	1989515	1989458	The AFL-CIO is waiting until October to decide if it will endorse a candidate .	The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .
+0	1783137	1782659	No dates have been set for the civil or the criminal trial .	No dates have been set for the criminal or civil cases , but Shanley has pleaded not guilty .
+1	3039165	3039036	Wal-Mart said it would check all of its million-plus domestic workers to ensure they were legally employed .	It has also said it would review all of its domestic employees more than 1 million to ensure they have legal status .
diff --git a/Optimus/code/examples/tests_samples/SQUAD/dev-v2.0-small.json b/Optimus/code/examples/tests_samples/SQUAD/dev-v2.0-small.json
new file mode 100755
index 0000000000000000000000000000000000000000..834d9ee6602b300ea45c67212800b0bbf6d1129e
--- /dev/null
+++ b/Optimus/code/examples/tests_samples/SQUAD/dev-v2.0-small.json
@@ -0,0 +1,140 @@
+{
+    "version": "v2.0",
+    "data": [{
+        "title": "Normans",
+        "paragraphs": [{
+            "qas": [{
+                "question": "In what country is Normandy located?",
+                "id": "56ddde6b9a695914005b9628",
+                "answers": [{
+                    "text": "France",
+                    "answer_start": 159
+                }],
+                "is_impossible": false
+            }, {
+                "question": "When were the Normans in Normandy?",
+                "id": "56ddde6b9a695914005b9629",
+                "answers": [{
+                    "text": "10th and 11th centuries",
+                    "answer_start": 94
+                }],
+                "is_impossible": false
+            }, {
+                "question": "From which countries did the Norse originate?",
+                "id": "56ddde6b9a695914005b962a",
+                "answers": [{
+                    "text": "Denmark, Iceland and Norway",
+                    "answer_start": 256
+                }],
+                "is_impossible": false
+            }, {
+                "plausible_answers": [{
+                    "text": "Rollo",
+                    "answer_start": 308
+                }],
+                "question": "Who did King Charles III swear fealty to?",
+                "id": "5ad39d53604f3c001a3fe8d3",
+                "answers": [],
+                "is_impossible": true
+            }, {
+                "plausible_answers": [{
+                    "text": "10th century",
+                    "answer_start": 671
+                }],
+                "question": "When did the Frankish identity emerge?",
+                "id": "5ad39d53604f3c001a3fe8d4",
+                "answers": [],
+                "is_impossible": true
+            }],
+            "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
+        }, {
+            "qas": [{
+                "question": "Who was the duke in the battle of Hastings?",
+                "id": "56dddf4066d3e219004dad5f",
+                "answers": [{
+                    "text": "William the Conqueror",
+                    "answer_start": 1022
+                }],
+                "is_impossible": false
+            }, {
+                "plausible_answers": [{
+                    "text": "Antioch",
+                    "answer_start": 1295
+                }],
+                "question": "What principality did William the conquerer found?",
+                "id": "5ad3a266604f3c001a3fea2b",
+                "answers": [],
+                "is_impossible": true
+            }],
+            "context": "The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy into which they assimilated. They adopted the Gallo-Romance language of the Frankish land they settled, their dialect becoming known as Norman, Normaund or Norman French, an important literary language. The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure. The Normans are noted both for their culture, such as their unique Romanesque architecture and musical traditions, and for their significant military accomplishments and innovations. Norman adventurers founded the Kingdom of Sicily under Roger II after conquering southern Italy on the Saracens and Byzantines, and an expedition on behalf of their duke, William the Conqueror, led to the Norman conquest of England at the Battle of Hastings in 1066. Norman cultural and military influence spread from these new European centres to the Crusader states of the Near East, where their prince Bohemond I founded the Principality of Antioch in the Levant, to Scotland and Wales in Great Britain, to Ireland, and to the coasts of north Africa and the Canary Islands."
+        }]
+    }, {
+        "title": "Computational_complexity_theory",
+        "paragraphs": [{
+            "qas": [{
+                "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship?",
+                "id": "56e16182e3433e1400422e28",
+                "answers": [{
+                    "text": "Computational complexity theory",
+                    "answer_start": 0
+                }],
+                "is_impossible": false
+            }, {
+                "plausible_answers": [{
+                    "text": "algorithm",
+                    "answer_start": 472
+                }],
+                "question": "What is a manual application of mathematical steps?",
+                "id": "5ad5316b5b96ef001a10ab76",
+                "answers": [],
+                "is_impossible": true
+            }],
+            "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm."
+        }, {
+            "qas": [{
+                "question": "What measure of a computational problem broadly defines the inherent difficulty of the solution?",
+                "id": "56e16839cd28a01900c67887",
+                "answers": [{
+                    "text": "if its solution requires significant resources",
+                    "answer_start": 46
+                }],
+                "is_impossible": false
+            }, {
+                "question": "What method is used to intuitively assess or quantify the amount of resources required to solve a computational problem?",
+                "id": "56e16839cd28a01900c67888",
+                "answers": [{
+                    "text": "mathematical models of computation",
+                    "answer_start": 176
+                }],
+                "is_impossible": false
+            }, {
+                "question": "What are two basic primary resources used to guage complexity?",
+                "id": "56e16839cd28a01900c67889",
+                "answers": [{
+                    "text": "time and storage",
+                    "answer_start": 305
+                }],
+                "is_impossible": false
+            }, {
+                "plausible_answers": [{
+                    "text": "the number of gates in a circuit",
+                    "answer_start": 436
+                }],
+                "question": "What unit is measured to determine circuit simplicity?",
+                "id": "5ad532575b96ef001a10ab7f",
+                "answers": [],
+                "is_impossible": true
+            }, {
+                "plausible_answers": [{
+                    "text": "the number of processors",
+                    "answer_start": 502
+                }],
+                "question": "What number is used in perpendicular computing?",
+                "id": "5ad532575b96ef001a10ab80",
+                "answers": [],
+                "is_impossible": true
+            }],
+            "context": "A problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. Other complexity measures are also used, such as the amount of communication (used in communication complexity), the number of gates in a circuit (used in circuit complexity) and the number of processors (used in parallel computing). One of the roles of computational complexity theory is to determine the practical limits on what computers can and cannot do."
+        }]
+    }]
+}
\ No newline at end of file
diff --git a/Optimus/code/examples/utils_glue.py b/Optimus/code/examples/utils_glue.py
new file mode 100755
index 0000000000000000000000000000000000000000..4fad8f47c589b3d49df4bb2b2eda96322f425541
--- /dev/null
+++ b/Optimus/code/examples/utils_glue.py
@@ -0,0 +1,703 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BERT classification fine-tuning: utilities to work with GLUE tasks """
+
+from __future__ import absolute_import, division, print_function
+
+import csv
+import logging
+import os
+import sys
+from io import open
+from collections import defaultdict
+import numpy as np
+import pdb
+
+from scipy.stats import pearsonr, spearmanr
+from sklearn.metrics import matthews_corrcoef, f1_score
+
+
+logger = logging.getLogger(__name__)
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+        Args:
+            guid: Unique id for the example.
+            text_a: string. The untokenized text of the first sequence. For single
+            sequence tasks, only this sequence must be specified.
+            text_b: (Optional) string. The untokenized text of the second sequence.
+            Only must be specified for sequence pair tasks.
+            label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r", encoding="utf-8-sig") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                if sys.version_info[0] == 2:
+                    line = list(unicode(cell, 'utf-8') for cell in line)
+                lines.append(line)
+            return lines
+
+
+class MrpcProcessor(DataProcessor):
+    """Processor for the MRPC data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, i)
+            text_a = line[3]
+            text_b = line[4]
+            label = line[0]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class MnliProcessor(DataProcessor):
+    """Processor for the MultiNLI data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
+            "dev_matched")
+
+    def get_labels(self):
+        """See base class."""
+        return ["contradiction", "entailment", "neutral"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, line[0])
+            text_a = line[8]
+            text_b = line[9]
+            label = line[-1]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class MnliMismatchedProcessor(MnliProcessor):
+    """Processor for the MultiNLI Mismatched data set (GLUE version)."""
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")),
+            "dev_matched")
+
+
+class ColaProcessor(DataProcessor):
+    """Processor for the CoLA data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train", percentage_per_label)
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type, percentage_per_label=1.0, sample_per_label=0):
+        """Creates examples for the training and dev sets."""
+        dict_label2examples = defaultdict(list)
+        examples = []
+        for (i, line) in enumerate(lines):
+            guid = "%s-%s" % (set_type, i)
+            text_a = line[3]
+            label = line[1]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
+            dict_label2examples[label].append(i)
+
+
+        if percentage_per_label<1.0:
+            nlabel = GLUE_TASKS_NUM_LABELS['cola']
+            examples_sub = [] 
+            for i in range(nlabel):
+                index = np.random.choice(dict_label2examples[str(i)], int(len(dict_label2examples[str(i)])*percentage_per_label), replace=False)
+                for j in index:
+                    examples_sub.append(examples[j])
+            examples = examples_sub
+
+        # pdb.set_trace()
+        return examples
+
+
+class YelpProcessor(DataProcessor):
+    """Processor for the Yelp short data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train", percentage_per_label, sample_per_label)
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "test.tsv")), "test", percentage_per_label=1.0, sample_per_label=5000)
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type, percentage_per_label=1.0, sample_per_label=0):
+        """Creates examples for the training and dev sets."""
+        dict_label2examples = defaultdict(list)
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, i)
+            text_a = line[1]
+            label = line[0]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
+            dict_label2examples[label].append(i-1)
+
+
+        if percentage_per_label<1.0 or sample_per_label>0:
+            nlabel = GLUE_TASKS_NUM_LABELS['yelp']
+            examples_sub = [] 
+            for i in range(nlabel):
+                if sample_per_label > 0:
+                    index = np.random.choice(dict_label2examples[str(i)], sample_per_label, replace=False)
+                else:
+                    index = np.random.choice(dict_label2examples[str(i)], int(len(dict_label2examples[str(i)])*percentage_per_label), replace=False)
+
+                for j in index:
+                    examples_sub.append(examples[j])
+            examples = examples_sub
+
+        # pdb.set_trace()
+        return examples
+
+
+
+class Sst2Processor(DataProcessor):
+    """Processor for the SST-2 data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train", percentage_per_label)
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type, percentage_per_label=1.0):
+        """Creates examples for the training and dev sets."""
+        dict_label2examples = defaultdict(list)
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, i)
+            text_a = line[0]
+            label = line[1]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
+            dict_label2examples[label].append(i-1)
+
+        if percentage_per_label<1.0:
+            nlabel = GLUE_TASKS_NUM_LABELS['sst-2']
+            examples_sub = [] 
+            for i in range(nlabel):
+                index = np.random.choice(dict_label2examples[str(i)], int(len(dict_label2examples[str(i)])*percentage_per_label), replace=False)
+                for j in index:
+                    examples_sub.append(examples[j])
+            examples = examples_sub
+
+
+        return examples
+
+
+class StsbProcessor(DataProcessor):
+    """Processor for the STS-B data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_labels(self):
+        """See base class."""
+        return [None]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, line[0])
+            text_a = line[7]
+            text_b = line[8]
+            label = line[-1]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class QqpProcessor(DataProcessor):
+    """Processor for the QQP data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, line[0])
+            try:
+                text_a = line[3]
+                text_b = line[4]
+                label = line[5]
+            except IndexError:
+                continue
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class QnliProcessor(DataProcessor):
+    """Processor for the QNLI data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), 
+            "dev_matched")
+
+    def get_labels(self):
+        """See base class."""
+        return ["entailment", "not_entailment"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, line[0])
+            text_a = line[1]
+            text_b = line[2]
+            label = line[-1]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class RteProcessor(DataProcessor):
+    """Processor for the RTE data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_labels(self):
+        """See base class."""
+        return ["entailment", "not_entailment"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, line[0])
+            text_a = line[1]
+            text_b = line[2]
+            label = line[-1]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class WnliProcessor(DataProcessor):
+    """Processor for the WNLI data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir, percentage_per_label=1.0, sample_per_label=0):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, line[0])
+            text_a = line[1]
+            text_b = line[2]
+            label = line[-1]
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+def convert_examples_to_features(examples, label_list, max_seq_length,
+                                 tokenizer, output_mode,
+                                 cls_token_at_end=False,
+                                 cls_token='[CLS]',
+                                 cls_token_segment_id=1,
+                                 sep_token='[SEP]',
+                                 sep_token_extra=False,
+                                 pad_on_left=False,
+                                 pad_token=0,
+                                 pad_token_segment_id=0,
+                                 sequence_a_segment_id=0, 
+                                 sequence_b_segment_id=1,
+                                 mask_padding_with_zero=True):
+    """ Loads a data file into a list of `InputBatch`s
+        `cls_token_at_end` define the location of the CLS token:
+            - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
+            - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
+        `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
+    """
+
+    label_map = {label : i for i, label in enumerate(label_list)}
+
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        if ex_index % 10000 == 0:
+            logger.info("Writing example %d of %d" % (ex_index, len(examples)))
+
+        tokens_a = tokenizer.tokenize(example.text_a)
+
+        tokens_b = None
+        if example.text_b:
+            tokens_b = tokenizer.tokenize(example.text_b)
+            # Modifies `tokens_a` and `tokens_b` in place so that the total
+            # length is less than the specified length.
+            # Account for [CLS], [SEP], [SEP] with "- 3". " -4" for RoBERTa.
+            special_tokens_count = 4 if sep_token_extra else 3
+            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - special_tokens_count)
+        else:
+            # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
+            special_tokens_count = 3 if sep_token_extra else 2
+            if len(tokens_a) > max_seq_length - special_tokens_count:
+                tokens_a = tokens_a[:(max_seq_length - special_tokens_count)]
+
+        # The convention in BERT is:
+        # (a) For sequence pairs:
+        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+        #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
+        # (b) For single sequences:
+        #  tokens:   [CLS] the dog is hairy . [SEP]
+        #  type_ids:   0   0   0   0  0     0   0
+        #
+        # Where "type_ids" are used to indicate whether this is the first
+        # sequence or the second sequence. The embedding vectors for `type=0` and
+        # `type=1` were learned during pre-training and are added to the wordpiece
+        # embedding vector (and position vector). This is not *strictly* necessary
+        # since the [SEP] token unambiguously separates the sequences, but it makes
+        # it easier for the model to learn the concept of sequences.
+        #
+        # For classification tasks, the first vector (corresponding to [CLS]) is
+        # used as as the "sentence vector". Note that this only makes sense because
+        # the entire model is fine-tuned.
+        tokens = tokens_a + [sep_token]
+        if sep_token_extra:
+            # roberta uses an extra separator b/w pairs of sentences
+            tokens += [sep_token]
+        segment_ids = [sequence_a_segment_id] * len(tokens)
+
+        if tokens_b:
+            tokens += tokens_b + [sep_token]
+            segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)
+
+        if cls_token_at_end:
+            tokens = tokens + [cls_token]
+            segment_ids = segment_ids + [cls_token_segment_id]
+        else:
+            tokens = [cls_token] + tokens
+            segment_ids = [cls_token_segment_id] + segment_ids
+
+        input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+        # The mask has 1 for real tokens and 0 for padding tokens. Only real
+        # tokens are attended to.
+        input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
+
+        # Zero-pad up to the sequence length.
+        padding_length = max_seq_length - len(input_ids)
+        if pad_on_left:
+            input_ids = ([pad_token] * padding_length) + input_ids
+            input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
+            segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
+        else:
+            input_ids = input_ids + ([pad_token] * padding_length)
+            input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
+            segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
+
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+
+        if output_mode == "classification":
+            label_id = label_map[example.label]
+        elif output_mode == "regression":
+            label_id = float(example.label)
+        else:
+            raise KeyError(output_mode)
+
+        if ex_index < 5:
+            logger.info("*** Example ***")
+            logger.info("guid: %s" % (example.guid))
+            logger.info("tokens: %s" % " ".join(
+                    [str(x) for x in tokens]))
+            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+            logger.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+            logger.info("label: %s (id = %d)" % (example.label, label_id))
+
+        features.append(
+                InputFeatures(input_ids=input_ids,
+                              input_mask=input_mask,
+                              segment_ids=segment_ids,
+                              label_id=label_id))
+    return features
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+def simple_accuracy(preds, labels):
+    return (preds == labels).mean()
+
+
+def acc_and_f1(preds, labels):
+    acc = simple_accuracy(preds, labels)
+    f1 = f1_score(y_true=labels, y_pred=preds)
+    return {
+        "acc": acc,
+        "f1": f1,
+        "acc_and_f1": (acc + f1) / 2,
+    }
+
+
+def pearson_and_spearman(preds, labels):
+    pearson_corr = pearsonr(preds, labels)[0]
+    spearman_corr = spearmanr(preds, labels)[0]
+    return {
+        "pearson": pearson_corr,
+        "spearmanr": spearman_corr,
+        "corr": (pearson_corr + spearman_corr) / 2,
+    }
+
+
+def compute_metrics(task_name, preds, labels):
+    assert len(preds) == len(labels)
+    if task_name == "cola":
+        return {"mcc": matthews_corrcoef(labels, preds)}
+    elif task_name == "sst-2":
+        return {"acc": simple_accuracy(preds, labels)}
+    elif task_name == "mrpc":
+        return acc_and_f1(preds, labels)
+    elif task_name == "sts-b":
+        return pearson_and_spearman(preds, labels)
+    elif task_name == "qqp":
+        return acc_and_f1(preds, labels)
+    elif task_name == "mnli":
+        return {"acc": simple_accuracy(preds, labels)}
+    elif task_name == "mnli-mm":
+        return {"acc": simple_accuracy(preds, labels)}
+    elif task_name == "qnli":
+        return {"acc": simple_accuracy(preds, labels)}
+    elif task_name == "rte":
+        return {"acc": simple_accuracy(preds, labels)}
+    elif task_name == "wnli":
+        return {"acc": simple_accuracy(preds, labels)}
+    elif task_name == "yelp":
+        return {"acc": simple_accuracy(preds, labels)}
+    else:
+        raise KeyError(task_name)
+
+processors = {
+    "cola": ColaProcessor,
+    "mnli": MnliProcessor,
+    "mnli-mm": MnliMismatchedProcessor,
+    "mrpc": MrpcProcessor,
+    "sst-2": Sst2Processor,
+    "sts-b": StsbProcessor,
+    "qqp": QqpProcessor,
+    "qnli": QnliProcessor,
+    "rte": RteProcessor,
+    "wnli": WnliProcessor,
+    "yelp": YelpProcessor,
+}
+
+output_modes = {
+    "cola": "classification",
+    "mnli": "classification",
+    "mnli-mm": "classification",
+    "mrpc": "classification",
+    "sst-2": "classification",
+    "sts-b": "regression",
+    "qqp": "classification",
+    "qnli": "classification",
+    "rte": "classification",
+    "wnli": "classification",
+    "yelp": "classification",
+}
+
+GLUE_TASKS_NUM_LABELS = {
+    "cola": 2,
+    "mnli": 3,
+    "mrpc": 2,
+    "sst-2": 2,
+    "sts-b": 1,
+    "qqp": 2,
+    "qnli": 2,
+    "rte": 2,
+    "wnli": 2,
+    "yelp": 2,
+}
diff --git a/Optimus/code/examples/utils_multiple_choice.py b/Optimus/code/examples/utils_multiple_choice.py
new file mode 100755
index 0000000000000000000000000000000000000000..7abcc5e1e9ea20d63a03a831ed68cb5b71328bc7
--- /dev/null
+++ b/Optimus/code/examples/utils_multiple_choice.py
@@ -0,0 +1,463 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BERT multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension  """
+
+from __future__ import absolute_import, division, print_function
+
+
+import logging
+import os
+import sys
+from io import open
+import json
+import csv
+import glob
+import tqdm
+
+
+logger = logging.getLogger(__name__)
+
+
+class InputExample(object):
+    """A single training/test example for multiple choice"""
+
+    def __init__(self, example_id, question,  contexts, endings, label=None):
+        """Constructs a InputExample.
+
+        Args:
+            example_id: Unique id for the example.
+            contexts: list of str. The untokenized text of the first sequence (context of corresponding question).
+            question: string. The untokenized text of the second sequence (qustion).
+            endings: list of str. multiple choice's options. Its length must be equal to contexts' length.
+            label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.example_id = example_id
+        self.question = question
+        self.contexts = contexts
+        self.endings = endings
+        self.label = label
+
+
+class InputFeatures(object):
+    def __init__(self,
+                 example_id,
+                 choices_features,
+                 label
+
+    ):
+        self.example_id = example_id
+        self.choices_features = [
+            {
+                'input_ids': input_ids,
+                'input_mask': input_mask,
+                'segment_ids': segment_ids
+            }
+            for _, input_ids, input_mask, segment_ids in choices_features
+        ]
+        self.label = label
+
+
+class DataProcessor(object):
+    """Base class for data converters for multiple choice data sets."""
+
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_test_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the test set."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+
+class RaceProcessor(DataProcessor):
+    """Processor for the RACE data set."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} train".format(data_dir))
+        high = os.path.join(data_dir, 'train/high')
+        middle = os.path.join(data_dir, 'train/middle')
+        high = self._read_txt(high)
+        middle = self._read_txt(middle)
+        return self._create_examples(high + middle, 'train')
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+        high = os.path.join(data_dir, 'dev/high')
+        middle = os.path.join(data_dir, 'dev/middle')
+        high = self._read_txt(high)
+        middle = self._read_txt(middle)
+        return self._create_examples(high + middle, 'dev')
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} test".format(data_dir))
+        high = os.path.join(data_dir, 'test/high')
+        middle = os.path.join(data_dir, 'test/middle')
+        high = self._read_txt(high)
+        middle = self._read_txt(middle)
+        return self._create_examples(high + middle, 'test')
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1", "2", "3"]
+
+    def _read_txt(self, input_dir):
+        lines = []
+        files = glob.glob(input_dir + "/*txt")
+        for file in tqdm.tqdm(files, desc="read files"):
+            with open(file, 'r', encoding='utf-8') as fin:
+                data_raw = json.load(fin)
+                data_raw["race_id"] = file
+                lines.append(data_raw)
+        return lines
+
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (_, data_raw) in enumerate(lines):
+            race_id = "%s-%s" % (set_type, data_raw["race_id"])
+            article = data_raw["article"]
+            for i in range(len(data_raw["answers"])):
+                truth = str(ord(data_raw['answers'][i]) - ord('A'))
+                question = data_raw['questions'][i]
+                options = data_raw['options'][i]
+
+                examples.append(
+                    InputExample(
+                        example_id=race_id,
+                        question=question,
+                        contexts=[article, article, article, article], # this is not efficient but convenient
+                        endings=[options[0], options[1], options[2], options[3]],
+                        label=truth))
+        return examples
+
+class SwagProcessor(DataProcessor):
+    """Processor for the SWAG data set."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} train".format(data_dir))
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "train.csv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "val.csv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+        raise ValueError(
+            "For swag testing, the input file does not contain a label column. It can not be tested in current code"
+            "setting!"
+        )
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "test.csv")), "test")
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1", "2", "3"]
+
+    def _read_csv(self, input_file):
+        with open(input_file, 'r', encoding='utf-8') as f:
+            reader = csv.reader(f)
+            lines = []
+            for line in reader:
+                if sys.version_info[0] == 2:
+                    line = list(unicode(cell, 'utf-8') for cell in line)
+                lines.append(line)
+            return lines
+
+
+    def _create_examples(self, lines, type):
+        """Creates examples for the training and dev sets."""
+        if type == "train" and lines[0][-1] != 'label':
+            raise ValueError(
+                "For training, the input file must contain a label column."
+            )
+
+        examples = [
+            InputExample(
+                example_id=line[2],
+                question=line[5],  # in the swag dataset, the
+                # common beginning of each
+                # choice is stored in "sent2".
+                contexts = [line[4], line[4], line[4], line[4]],
+                endings = [line[7], line[8], line[9], line[10]],
+                label=line[11]
+            ) for line in lines[1:]  # we skip the line with the column names
+        ]
+
+        return examples
+
+
+class ArcProcessor(DataProcessor):
+    """Processor for the ARC data set (request from allennlp)."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} train".format(data_dir))
+        return self._create_examples(self._read_json(os.path.join(data_dir, "train.jsonl")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+        return self._create_examples(self._read_json(os.path.join(data_dir, "dev.jsonl")), "dev")
+
+    def get_test_examples(self, data_dir):
+        logger.info("LOOKING AT {} test".format(data_dir))
+        return self._create_examples(self._read_json(os.path.join(data_dir, "test.jsonl")), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1", "2", "3"]
+
+    def _read_json(self, input_file):
+        with open(input_file, 'r', encoding='utf-8') as fin:
+            lines = fin.readlines()
+            return lines
+
+
+    def _create_examples(self, lines, type):
+        """Creates examples for the training and dev sets."""
+
+        #There are two types of labels. They should be normalized
+        def normalize(truth):
+            if truth in "ABCD":
+                return ord(truth) - ord("A")
+            elif truth in "1234":
+                return int(truth) - 1
+            else:
+                logger.info("truth ERROR! %s", str(truth))
+                return None
+
+        examples = []
+        three_choice = 0
+        four_choice = 0
+        five_choice = 0
+        other_choices = 0
+        # we deleted example which has more than or less than four choices
+        for line in tqdm.tqdm(lines, desc="read arc data"):
+            data_raw = json.loads(line.strip("\n"))
+            if len(data_raw["question"]["choices"]) == 3:
+                three_choice += 1
+                continue
+            elif len(data_raw["question"]["choices"]) == 5:
+                five_choice += 1
+                continue
+            elif len(data_raw["question"]["choices"]) != 4:
+                other_choices += 1
+                continue
+            four_choice += 1
+            truth = str(normalize(data_raw["answerKey"]))
+            assert truth != "None"
+            question_choices = data_raw["question"]
+            question = question_choices["stem"]
+            id = data_raw["id"]
+            options = question_choices["choices"]
+            if len(options) == 4:
+                examples.append(
+                    InputExample(
+                        example_id = id,
+                        question=question,
+                        contexts=[options[0]["para"].replace("_", ""), options[1]["para"].replace("_", ""),
+                                  options[2]["para"].replace("_", ""), options[3]["para"].replace("_", "")],
+                        endings=[options[0]["text"], options[1]["text"], options[2]["text"], options[3]["text"]],
+                        label=truth))
+
+        if type == "train":
+            assert len(examples) > 1
+            assert examples[0].label is not None
+        logger.info("len examples: %s}", str(len(examples)))
+        logger.info("Three choices: %s", str(three_choice))
+        logger.info("Five choices: %s", str(five_choice))
+        logger.info("Other choices: %s", str(other_choices))
+        logger.info("four choices: %s", str(four_choice))
+
+        return examples
+
+
+def convert_examples_to_features(examples, label_list, max_seq_length,
+                                 tokenizer,
+                                 cls_token_at_end=False,
+                                 cls_token='[CLS]',
+                                 cls_token_segment_id=1,
+                                 sep_token='[SEP]',
+                                 sequence_a_segment_id=0,
+                                 sequence_b_segment_id=1,
+                                 sep_token_extra=False,
+                                 pad_token_segment_id=0,
+                                 pad_on_left=False,
+                                 pad_token=0,
+                                 mask_padding_with_zero=True):
+    """ Loads a data file into a list of `InputBatch`s
+        `cls_token_at_end` define the location of the CLS token:
+            - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
+            - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
+        `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
+    """
+
+    label_map = {label : i for i, label in enumerate(label_list)}
+
+    features = []
+    for (ex_index, example) in tqdm.tqdm(enumerate(examples), desc="convert examples to features"):
+        if ex_index % 10000 == 0:
+            logger.info("Writing example %d of %d" % (ex_index, len(examples)))
+        choices_features = []
+        for ending_idx, (context, ending) in enumerate(zip(example.contexts, example.endings)):
+            tokens_a = tokenizer.tokenize(context)
+            tokens_b = None
+            if example.question.find("_") != -1:
+                #this is for cloze question
+                tokens_b = tokenizer.tokenize(example.question.replace("_", ending))
+            else:
+                tokens_b = tokenizer.tokenize(example.question + " " + ending)
+                # you can add seq token between quesiotn and ending. This does not make too much difference.
+                # tokens_b = tokenizer.tokenize(example.question)
+                # tokens_b += [sep_token]
+                # if sep_token_extra:
+                #     tokens_b += [sep_token]
+                # tokens_b += tokenizer.tokenize(ending)
+
+            special_tokens_count = 4 if sep_token_extra else 3
+            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - special_tokens_count)
+
+            # The convention in BERT is:
+            # (a) For sequence pairs:
+            #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+            #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
+            # (b) For single sequences:
+            #  tokens:   [CLS] the dog is hairy . [SEP]
+            #  type_ids:   0   0   0   0  0     0   0
+            #
+            # Where "type_ids" are used to indicate whether this is the first
+            # sequence or the second sequence. The embedding vectors for `type=0` and
+            # `type=1` were learned during pre-training and are added to the wordpiece
+            # embedding vector (and position vector). This is not *strictly* necessary
+            # since the [SEP] token unambiguously separates the sequences, but it makes
+            # it easier for the model to learn the concept of sequences.
+            #
+            # For classification tasks, the first vector (corresponding to [CLS]) is
+            # used as as the "sentence vector". Note that this only makes sense because
+            # the entire model is fine-tuned.
+            tokens = tokens_a + [sep_token]
+            if sep_token_extra:
+                # roberta uses an extra separator b/w pairs of sentences
+                tokens += [sep_token]
+
+            segment_ids = [sequence_a_segment_id] * len(tokens)
+
+            if tokens_b:
+                tokens += tokens_b + [sep_token]
+                segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)
+
+            if cls_token_at_end:
+                tokens = tokens + [cls_token]
+                segment_ids = segment_ids + [cls_token_segment_id]
+            else:
+                tokens = [cls_token] + tokens
+                segment_ids = [cls_token_segment_id] + segment_ids
+
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+            # The mask has 1 for real tokens and 0 for padding tokens. Only real
+            # tokens are attended to.
+            input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
+
+            # Zero-pad up to the sequence length.
+            padding_length = max_seq_length - len(input_ids)
+            if pad_on_left:
+                input_ids = ([pad_token] * padding_length) + input_ids
+                input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
+                segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
+            else:
+                input_ids = input_ids + ([pad_token] * padding_length)
+                input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
+                segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
+
+            assert len(input_ids) == max_seq_length
+            assert len(input_mask) == max_seq_length
+            assert len(segment_ids) == max_seq_length
+            choices_features.append((tokens, input_ids, input_mask, segment_ids))
+        label = label_map[example.label]
+
+        if ex_index < 2:
+            logger.info("*** Example ***")
+            logger.info("race_id: {}".format(example.example_id))
+            for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
+                logger.info("choice: {}".format(choice_idx))
+                logger.info("tokens: {}".format(' '.join(tokens)))
+                logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
+                logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
+                logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
+                logger.info("label: {}".format(label))
+
+        features.append(
+            InputFeatures(
+                example_id = example.example_id,
+                choices_features = choices_features,
+                label = label
+            )
+        )
+
+    return features
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+
+    # However, since we'd better not to remove tokens of options and questions, you can choose to use a bigger
+    # length or only pop from context
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            logger.info('Attention! you are removing from token_b (swag task is ok). '
+                        'If you are training ARC and RACE (you are poping question + options), '
+                        'you need to try to use a bigger max seq length!')
+            tokens_b.pop()
+
+
+processors = {
+    "race": RaceProcessor,
+    "swag": SwagProcessor,
+    "arc": ArcProcessor
+}
+
+
+GLUE_TASKS_NUM_LABELS = {
+    "race", 4,
+    "swag", 4,
+    "arc", 4
+}
diff --git a/Optimus/code/examples/utils_squad.py b/Optimus/code/examples/utils_squad.py
new file mode 100755
index 0000000000000000000000000000000000000000..34a0c9cc02b04862821ec066f5b67571a2a86681
--- /dev/null
+++ b/Optimus/code/examples/utils_squad.py
@@ -0,0 +1,996 @@
+
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Load SQuAD dataset. """
+
+from __future__ import absolute_import, division, print_function
+
+import json
+import logging
+import math
+import collections
+from io import open
+
+from pytorch_transformers.tokenization_bert import BasicTokenizer, whitespace_tokenize
+
+# Required by XLNet evaluation method to compute optimal threshold (see write_predictions_extended() method)
+from utils_squad_evaluate import find_all_best_thresh_v2, make_qid_to_has_ans, get_raw_scores
+
+logger = logging.getLogger(__name__)
+
+
+class SquadExample(object):
+    """
+    A single training/test example for the Squad dataset.
+    For examples without an answer, the start and end position are -1.
+    """
+
+    def __init__(self,
+                 qas_id,
+                 question_text,
+                 doc_tokens,
+                 orig_answer_text=None,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=None):
+        self.qas_id = qas_id
+        self.question_text = question_text
+        self.doc_tokens = doc_tokens
+        self.orig_answer_text = orig_answer_text
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+
+    def __str__(self):
+        return self.__repr__()
+
+    def __repr__(self):
+        s = ""
+        s += "qas_id: %s" % (self.qas_id)
+        s += ", question_text: %s" % (
+            self.question_text)
+        s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens))
+        if self.start_position:
+            s += ", start_position: %d" % (self.start_position)
+        if self.end_position:
+            s += ", end_position: %d" % (self.end_position)
+        if self.is_impossible:
+            s += ", is_impossible: %r" % (self.is_impossible)
+        return s
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self,
+                 unique_id,
+                 example_index,
+                 doc_span_index,
+                 tokens,
+                 token_to_orig_map,
+                 token_is_max_context,
+                 input_ids,
+                 input_mask,
+                 segment_ids,
+                 cls_index,
+                 p_mask,
+                 paragraph_len,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=None):
+        self.unique_id = unique_id
+        self.example_index = example_index
+        self.doc_span_index = doc_span_index
+        self.tokens = tokens
+        self.token_to_orig_map = token_to_orig_map
+        self.token_is_max_context = token_is_max_context
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.cls_index = cls_index
+        self.p_mask = p_mask
+        self.paragraph_len = paragraph_len
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+
+
+def read_squad_examples(input_file, is_training, version_2_with_negative):
+    """Read a SQuAD json file into a list of SquadExample."""
+    with open(input_file, "r", encoding='utf-8') as reader:
+        input_data = json.load(reader)["data"]
+
+    def is_whitespace(c):
+        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
+            return True
+        return False
+
+    examples = []
+    for entry in input_data:
+        for paragraph in entry["paragraphs"]:
+            paragraph_text = paragraph["context"]
+            doc_tokens = []
+            char_to_word_offset = []
+            prev_is_whitespace = True
+            for c in paragraph_text:
+                if is_whitespace(c):
+                    prev_is_whitespace = True
+                else:
+                    if prev_is_whitespace:
+                        doc_tokens.append(c)
+                    else:
+                        doc_tokens[-1] += c
+                    prev_is_whitespace = False
+                char_to_word_offset.append(len(doc_tokens) - 1)
+
+            for qa in paragraph["qas"]:
+                qas_id = qa["id"]
+                question_text = qa["question"]
+                start_position = None
+                end_position = None
+                orig_answer_text = None
+                is_impossible = False
+                if is_training:
+                    if version_2_with_negative:
+                        is_impossible = qa["is_impossible"]
+                    if (len(qa["answers"]) != 1) and (not is_impossible):
+                        raise ValueError(
+                            "For training, each question should have exactly 1 answer.")
+                    if not is_impossible:
+                        answer = qa["answers"][0]
+                        orig_answer_text = answer["text"]
+                        answer_offset = answer["answer_start"]
+                        answer_length = len(orig_answer_text)
+                        start_position = char_to_word_offset[answer_offset]
+                        end_position = char_to_word_offset[answer_offset + answer_length - 1]
+                        # Only add answers where the text can be exactly recovered from the
+                        # document. If this CAN'T happen it's likely due to weird Unicode
+                        # stuff so we will just skip the example.
+                        #
+                        # Note that this means for training mode, every example is NOT
+                        # guaranteed to be preserved.
+                        actual_text = " ".join(doc_tokens[start_position:(end_position + 1)])
+                        cleaned_answer_text = " ".join(
+                            whitespace_tokenize(orig_answer_text))
+                        if actual_text.find(cleaned_answer_text) == -1:
+                            logger.warning("Could not find answer: '%s' vs. '%s'",
+                                           actual_text, cleaned_answer_text)
+                            continue
+                    else:
+                        start_position = -1
+                        end_position = -1
+                        orig_answer_text = ""
+
+                example = SquadExample(
+                    qas_id=qas_id,
+                    question_text=question_text,
+                    doc_tokens=doc_tokens,
+                    orig_answer_text=orig_answer_text,
+                    start_position=start_position,
+                    end_position=end_position,
+                    is_impossible=is_impossible)
+                examples.append(example)
+    return examples
+
+
+def convert_examples_to_features(examples, tokenizer, max_seq_length,
+                                 doc_stride, max_query_length, is_training,
+                                 cls_token_at_end=False,
+                                 cls_token='[CLS]', sep_token='[SEP]', pad_token=0,
+                                 sequence_a_segment_id=0, sequence_b_segment_id=1,
+                                 cls_token_segment_id=0, pad_token_segment_id=0,
+                                 mask_padding_with_zero=True):
+    """Loads a data file into a list of `InputBatch`s."""
+
+    unique_id = 1000000000
+    # cnt_pos, cnt_neg = 0, 0
+    # max_N, max_M = 1024, 1024
+    # f = np.zeros((max_N, max_M), dtype=np.float32)
+
+    features = []
+    for (example_index, example) in enumerate(examples):
+
+        # if example_index % 100 == 0:
+        #     logger.info('Converting %s/%s pos %s neg %s', example_index, len(examples), cnt_pos, cnt_neg)
+
+        query_tokens = tokenizer.tokenize(example.question_text)
+
+        if len(query_tokens) > max_query_length:
+            query_tokens = query_tokens[0:max_query_length]
+
+        tok_to_orig_index = []
+        orig_to_tok_index = []
+        all_doc_tokens = []
+        for (i, token) in enumerate(example.doc_tokens):
+            orig_to_tok_index.append(len(all_doc_tokens))
+            sub_tokens = tokenizer.tokenize(token)
+            for sub_token in sub_tokens:
+                tok_to_orig_index.append(i)
+                all_doc_tokens.append(sub_token)
+
+        tok_start_position = None
+        tok_end_position = None
+        if is_training and example.is_impossible:
+            tok_start_position = -1
+            tok_end_position = -1
+        if is_training and not example.is_impossible:
+            tok_start_position = orig_to_tok_index[example.start_position]
+            if example.end_position < len(example.doc_tokens) - 1:
+                tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
+            else:
+                tok_end_position = len(all_doc_tokens) - 1
+            (tok_start_position, tok_end_position) = _improve_answer_span(
+                all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
+                example.orig_answer_text)
+
+        # The -3 accounts for [CLS], [SEP] and [SEP]
+        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+
+        # We can have documents that are longer than the maximum sequence length.
+        # To deal with this we do a sliding window approach, where we take chunks
+        # of the up to our max length with a stride of `doc_stride`.
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+            "DocSpan", ["start", "length"])
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, doc_stride)
+
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = []
+            token_to_orig_map = {}
+            token_is_max_context = {}
+            segment_ids = []
+
+            # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
+            # Original TF implem also keep the classification token (set to 0) (not sure why...)
+            p_mask = []
+
+            # CLS token at the beginning
+            if not cls_token_at_end:
+                tokens.append(cls_token)
+                segment_ids.append(cls_token_segment_id)
+                p_mask.append(0)
+                cls_index = 0
+
+            # Query
+            for token in query_tokens:
+                tokens.append(token)
+                segment_ids.append(sequence_a_segment_id)
+                p_mask.append(1)
+
+            # SEP token
+            tokens.append(sep_token)
+            segment_ids.append(sequence_a_segment_id)
+            p_mask.append(1)
+
+            # Paragraph
+            for i in range(doc_span.length):
+                split_token_index = doc_span.start + i
+                token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]
+
+                is_max_context = _check_is_max_context(doc_spans, doc_span_index,
+                                                       split_token_index)
+                token_is_max_context[len(tokens)] = is_max_context
+                tokens.append(all_doc_tokens[split_token_index])
+                segment_ids.append(sequence_b_segment_id)
+                p_mask.append(0)
+            paragraph_len = doc_span.length
+
+            # SEP token
+            tokens.append(sep_token)
+            segment_ids.append(sequence_b_segment_id)
+            p_mask.append(1)
+
+            # CLS token at the end
+            if cls_token_at_end:
+                tokens.append(cls_token)
+                segment_ids.append(cls_token_segment_id)
+                p_mask.append(0)
+                cls_index = len(tokens) - 1  # Index of classification token
+
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+            # The mask has 1 for real tokens and 0 for padding tokens. Only real
+            # tokens are attended to.
+            input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
+
+            # Zero-pad up to the sequence length.
+            while len(input_ids) < max_seq_length:
+                input_ids.append(pad_token)
+                input_mask.append(0 if mask_padding_with_zero else 1)
+                segment_ids.append(pad_token_segment_id)
+                p_mask.append(1)
+
+            assert len(input_ids) == max_seq_length
+            assert len(input_mask) == max_seq_length
+            assert len(segment_ids) == max_seq_length
+
+            span_is_impossible = example.is_impossible
+            start_position = None
+            end_position = None
+            if is_training and not span_is_impossible:
+                # For training, if our document chunk does not contain an annotation
+                # we throw it out, since there is nothing to predict.
+                doc_start = doc_span.start
+                doc_end = doc_span.start + doc_span.length - 1
+                out_of_span = False
+                if not (tok_start_position >= doc_start and
+                        tok_end_position <= doc_end):
+                    out_of_span = True
+                if out_of_span:
+                    start_position = 0
+                    end_position = 0
+                    span_is_impossible = True
+                else:
+                    doc_offset = len(query_tokens) + 2
+                    start_position = tok_start_position - doc_start + doc_offset
+                    end_position = tok_end_position - doc_start + doc_offset
+
+            if is_training and span_is_impossible:
+                start_position = cls_index
+                end_position = cls_index
+
+            if example_index < 20:
+                logger.info("*** Example ***")
+                logger.info("unique_id: %s" % (unique_id))
+                logger.info("example_index: %s" % (example_index))
+                logger.info("doc_span_index: %s" % (doc_span_index))
+                logger.info("tokens: %s" % " ".join(tokens))
+                logger.info("token_to_orig_map: %s" % " ".join([
+                    "%d:%d" % (x, y) for (x, y) in token_to_orig_map.items()]))
+                logger.info("token_is_max_context: %s" % " ".join([
+                    "%d:%s" % (x, y) for (x, y) in token_is_max_context.items()
+                ]))
+                logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+                logger.info(
+                    "input_mask: %s" % " ".join([str(x) for x in input_mask]))
+                logger.info(
+                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+                if is_training and span_is_impossible:
+                    logger.info("impossible example")
+                if is_training and not span_is_impossible:
+                    answer_text = " ".join(tokens[start_position:(end_position + 1)])
+                    logger.info("start_position: %d" % (start_position))
+                    logger.info("end_position: %d" % (end_position))
+                    logger.info(
+                        "answer: %s" % (answer_text))
+
+            features.append(
+                InputFeatures(
+                    unique_id=unique_id,
+                    example_index=example_index,
+                    doc_span_index=doc_span_index,
+                    tokens=tokens,
+                    token_to_orig_map=token_to_orig_map,
+                    token_is_max_context=token_is_max_context,
+                    input_ids=input_ids,
+                    input_mask=input_mask,
+                    segment_ids=segment_ids,
+                    cls_index=cls_index,
+                    p_mask=p_mask,
+                    paragraph_len=paragraph_len,
+                    start_position=start_position,
+                    end_position=end_position,
+                    is_impossible=span_is_impossible))
+            unique_id += 1
+
+    return features
+
+
+def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
+                         orig_answer_text):
+    """Returns tokenized answer spans that better match the annotated answer."""
+
+    # The SQuAD annotations are character based. We first project them to
+    # whitespace-tokenized words. But then after WordPiece tokenization, we can
+    # often find a "better match". For example:
+    #
+    #   Question: What year was John Smith born?
+    #   Context: The leader was John Smith (1895-1943).
+    #   Answer: 1895
+    #
+    # The original whitespace-tokenized answer will be "(1895-1943).". However
+    # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
+    # the exact answer, 1895.
+    #
+    # However, this is not always possible. Consider the following:
+    #
+    #   Question: What country is the top exporter of electornics?
+    #   Context: The Japanese electronics industry is the lagest in the world.
+    #   Answer: Japan
+    #
+    # In this case, the annotator chose "Japan" as a character sub-span of
+    # the word "Japanese". Since our WordPiece tokenizer does not split
+    # "Japanese", we just use "Japanese" as the annotation. This is fairly rare
+    # in SQuAD, but does happen.
+    tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
+
+    for new_start in range(input_start, input_end + 1):
+        for new_end in range(input_end, new_start - 1, -1):
+            text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
+            if text_span == tok_answer_text:
+                return (new_start, new_end)
+
+    return (input_start, input_end)
+
+
+def _check_is_max_context(doc_spans, cur_span_index, position):
+    """Check if this is the 'max context' doc span for the token."""
+
+    # Because of the sliding window approach taken to scoring documents, a single
+    # token can appear in multiple documents. E.g.
+    #  Doc: the man went to the store and bought a gallon of milk
+    #  Span A: the man went to the
+    #  Span B: to the store and bought
+    #  Span C: and bought a gallon of
+    #  ...
+    #
+    # Now the word 'bought' will have two scores from spans B and C. We only
+    # want to consider the score with "maximum context", which we define as
+    # the *minimum* of its left and right context (the *sum* of left and
+    # right context will always be the same, of course).
+    #
+    # In the example the maximum context for 'bought' would be span C since
+    # it has 1 left context and 3 right context, while span B has 4 left context
+    # and 0 right context.
+    best_score = None
+    best_span_index = None
+    for (span_index, doc_span) in enumerate(doc_spans):
+        end = doc_span.start + doc_span.length - 1
+        if position < doc_span.start:
+            continue
+        if position > end:
+            continue
+        num_left_context = position - doc_span.start
+        num_right_context = end - position
+        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
+        if best_score is None or score > best_score:
+            best_score = score
+            best_span_index = span_index
+
+    return cur_span_index == best_span_index
+
+
+RawResult = collections.namedtuple("RawResult",
+                                   ["unique_id", "start_logits", "end_logits"])
+
+def write_predictions(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, do_lower_case, output_prediction_file,
+                      output_nbest_file, output_null_log_odds_file, verbose_logging,
+                      version_2_with_negative, null_score_diff_threshold):
+    """Write final predictions to the json file and log-odds of null if needed."""
+    logger.info("Writing predictions to: %s" % (output_prediction_file))
+    logger.info("Writing nbest to: %s" % (output_nbest_file))
+
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction",
+        ["feature_index", "start_index", "end_index", "start_logit", "end_logit"])
+
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        score_null = 1000000  # large and positive
+        min_null_feature_index = 0  # the paragraph slice with min null score
+        null_start_logit = 0  # the start logit at the slice with min null score
+        null_end_logit = 0  # the end logit at the slice with min null score
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+            # if we could have irrelevant answers, get the min score of irrelevant
+            if version_2_with_negative:
+                feature_null_score = result.start_logits[0] + result.end_logits[0]
+                if feature_null_score < score_null:
+                    score_null = feature_null_score
+                    min_null_feature_index = feature_index
+                    null_start_logit = result.start_logits[0]
+                    null_end_logit = result.end_logits[0]
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index]))
+        if version_2_with_negative:
+            prelim_predictions.append(
+                _PrelimPrediction(
+                    feature_index=min_null_feature_index,
+                    start_index=0,
+                    end_index=0,
+                    start_logit=null_start_logit,
+                    end_logit=null_end_logit))
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_logit + x.end_logit),
+            reverse=True)
+
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"])
+
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1)]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end + 1)]
+                tok_text = " ".join(tok_tokens)
+
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+
+                # Clean whitespace
+                tok_text = tok_text.strip()
+                tok_text = " ".join(tok_text.split())
+                orig_text = " ".join(orig_tokens)
+
+                final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)
+                if final_text in seen_predictions:
+                    continue
+
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+
+            nbest.append(
+                _NbestPrediction(
+                    text=final_text,
+                    start_logit=pred.start_logit,
+                    end_logit=pred.end_logit))
+        # if we didn't include the empty option in the n-best, include it
+        if version_2_with_negative:
+            if "" not in seen_predictions:
+                nbest.append(
+                    _NbestPrediction(
+                        text="",
+                        start_logit=null_start_logit,
+                        end_logit=null_end_logit))
+                
+            # In very rare edge cases we could only have single null prediction.
+            # So we just create a nonce prediction in this case to avoid failure.
+            if len(nbest)==1:
+                nbest.insert(0,
+                    _NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0))
+
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0))
+
+        assert len(nbest) >= 1
+
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+            if not best_non_null_entry:
+                if entry.text:
+                    best_non_null_entry = entry
+
+        probs = _compute_softmax(total_scores)
+
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+
+        assert len(nbest_json) >= 1
+
+        if not version_2_with_negative:
+            all_predictions[example.qas_id] = nbest_json[0]["text"]
+        else:
+            # predict "" iff the null score - the score of best non-null > threshold
+            score_diff = score_null - best_non_null_entry.start_logit - (
+                best_non_null_entry.end_logit)
+            scores_diff_json[example.qas_id] = score_diff
+            if score_diff > null_score_diff_threshold:
+                all_predictions[example.qas_id] = ""
+            else:
+                all_predictions[example.qas_id] = best_non_null_entry.text
+        all_nbest_json[example.qas_id] = nbest_json
+
+    with open(output_prediction_file, "w") as writer:
+        writer.write(json.dumps(all_predictions, indent=4) + "\n")
+
+    with open(output_nbest_file, "w") as writer:
+        writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
+
+    if version_2_with_negative:
+        with open(output_null_log_odds_file, "w") as writer:
+            writer.write(json.dumps(scores_diff_json, indent=4) + "\n")
+
+    return all_predictions
+
+
+# For XLNet (and XLM which uses the same head)
+RawResultExtended = collections.namedtuple("RawResultExtended",
+    ["unique_id", "start_top_log_probs", "start_top_index",
+     "end_top_log_probs", "end_top_index", "cls_logits"])
+
+
+def write_predictions_extended(all_examples, all_features, all_results, n_best_size,
+                                max_answer_length, output_prediction_file,
+                                output_nbest_file,
+                                output_null_log_odds_file, orig_data_file,
+                                start_n_top, end_n_top, version_2_with_negative,
+                                tokenizer, verbose_logging):
+    """ XLNet write prediction logic (more complex than Bert's).
+        Write final predictions to the json file and log-odds of null if needed.
+
+        Requires utils_squad_evaluate.py
+    """
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction",
+        ["feature_index", "start_index", "end_index",
+        "start_log_prob", "end_log_prob"])
+
+    _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "NbestPrediction", ["text", "start_log_prob", "end_log_prob"])
+
+    logger.info("Writing predictions to: %s", output_prediction_file)
+    # logger.info("Writing nbest to: %s" % (output_nbest_file))
+
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        score_null = 1000000  # large and positive
+
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+
+            cur_null_score = result.cls_logits
+
+            # if we could have irrelevant answers, get the min score of irrelevant
+            score_null = min(score_null, cur_null_score)
+
+            for i in range(start_n_top):
+                for j in range(end_n_top):
+                    start_log_prob = result.start_top_log_probs[i]
+                    start_index = result.start_top_index[i]
+
+                    j_index = i * end_n_top + j
+
+                    end_log_prob = result.end_top_log_probs[j_index]
+                    end_index = result.end_top_index[j_index]
+
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= feature.paragraph_len - 1:
+                        continue
+                    if end_index >= feature.paragraph_len - 1:
+                        continue
+
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_log_prob=start_log_prob,
+                            end_log_prob=end_log_prob))
+
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_log_prob + x.end_log_prob),
+            reverse=True)
+
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+
+            # XLNet un-tokenizer
+            # Let's keep it simple for now and see if we need all this later.
+            # 
+            # tok_start_to_orig_index = feature.tok_start_to_orig_index
+            # tok_end_to_orig_index = feature.tok_end_to_orig_index
+            # start_orig_pos = tok_start_to_orig_index[pred.start_index]
+            # end_orig_pos = tok_end_to_orig_index[pred.end_index]
+            # paragraph_text = example.paragraph_text
+            # final_text = paragraph_text[start_orig_pos: end_orig_pos + 1].strip()
+
+            # Previously used Bert untokenizer
+            tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1)]
+            orig_doc_start = feature.token_to_orig_map[pred.start_index]
+            orig_doc_end = feature.token_to_orig_map[pred.end_index]
+            orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end + 1)]
+            tok_text = tokenizer.convert_tokens_to_string(tok_tokens)
+
+            # Clean whitespace
+            tok_text = tok_text.strip()
+            tok_text = " ".join(tok_text.split())
+            orig_text = " ".join(orig_tokens)
+
+            final_text = get_final_text(tok_text, orig_text, tokenizer.do_lower_case,
+                                        verbose_logging)
+
+            if final_text in seen_predictions:
+                continue
+
+            seen_predictions[final_text] = True
+
+            nbest.append(
+                _NbestPrediction(
+                    text=final_text,
+                    start_log_prob=pred.start_log_prob,
+                    end_log_prob=pred.end_log_prob))
+
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(text="", start_log_prob=-1e6,
+                end_log_prob=-1e6))
+
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_log_prob + entry.end_log_prob)
+            if not best_non_null_entry:
+                best_non_null_entry = entry
+
+        probs = _compute_softmax(total_scores)
+
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_log_prob"] = entry.start_log_prob
+            output["end_log_prob"] = entry.end_log_prob
+            nbest_json.append(output)
+
+        assert len(nbest_json) >= 1
+        assert best_non_null_entry is not None
+
+        score_diff = score_null
+        scores_diff_json[example.qas_id] = score_diff
+        # note(zhiliny): always predict best_non_null_entry
+        # and the evaluation script will search for the best threshold
+        all_predictions[example.qas_id] = best_non_null_entry.text
+
+        all_nbest_json[example.qas_id] = nbest_json
+
+    with open(output_prediction_file, "w") as writer:
+        writer.write(json.dumps(all_predictions, indent=4) + "\n")
+
+    with open(output_nbest_file, "w") as writer:
+        writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
+
+    if version_2_with_negative:
+        with open(output_null_log_odds_file, "w") as writer:
+            writer.write(json.dumps(scores_diff_json, indent=4) + "\n")
+
+    with open(orig_data_file, "r", encoding='utf-8') as reader:
+        orig_data = json.load(reader)["data"]
+
+    qid_to_has_ans = make_qid_to_has_ans(orig_data)
+    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+    exact_raw, f1_raw = get_raw_scores(orig_data, all_predictions)
+    out_eval = {}
+
+    find_all_best_thresh_v2(out_eval, all_predictions, exact_raw, f1_raw, scores_diff_json, qid_to_has_ans)
+
+    return out_eval
+
+
+def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False):
+    """Project the tokenized prediction back to the original text."""
+
+    # When we created the data, we kept track of the alignment between original
+    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
+    # now `orig_text` contains the span of our original text corresponding to the
+    # span that we predicted.
+    #
+    # However, `orig_text` may contain extra characters that we don't want in
+    # our prediction.
+    #
+    # For example, let's say:
+    #   pred_text = steve smith
+    #   orig_text = Steve Smith's
+    #
+    # We don't want to return `orig_text` because it contains the extra "'s".
+    #
+    # We don't want to return `pred_text` because it's already been normalized
+    # (the SQuAD eval script also does punctuation stripping/lower casing but
+    # our tokenizer does additional normalization like stripping accent
+    # characters).
+    #
+    # What we really want to return is "Steve Smith".
+    #
+    # Therefore, we have to apply a semi-complicated alignment heuristic between
+    # `pred_text` and `orig_text` to get a character-to-character alignment. This
+    # can fail in certain cases in which case we just return `orig_text`.
+
+    def _strip_spaces(text):
+        ns_chars = []
+        ns_to_s_map = collections.OrderedDict()
+        for (i, c) in enumerate(text):
+            if c == " ":
+                continue
+            ns_to_s_map[len(ns_chars)] = i
+            ns_chars.append(c)
+        ns_text = "".join(ns_chars)
+        return (ns_text, ns_to_s_map)
+
+    # We first tokenize `orig_text`, strip whitespace from the result
+    # and `pred_text`, and check if they are the same length. If they are
+    # NOT the same length, the heuristic has failed. If they are the same
+    # length, we assume the characters are one-to-one aligned.
+    tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+
+    tok_text = " ".join(tokenizer.tokenize(orig_text))
+
+    start_position = tok_text.find(pred_text)
+    if start_position == -1:
+        if verbose_logging:
+            logger.info(
+                "Unable to find text: '%s' in '%s'" % (pred_text, orig_text))
+        return orig_text
+    end_position = start_position + len(pred_text) - 1
+
+    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+
+    if len(orig_ns_text) != len(tok_ns_text):
+        if verbose_logging:
+            logger.info("Length not equal after stripping spaces: '%s' vs '%s'",
+                        orig_ns_text, tok_ns_text)
+        return orig_text
+
+    # We then project the characters in `pred_text` back to `orig_text` using
+    # the character-to-character alignment.
+    tok_s_to_ns_map = {}
+    for (i, tok_index) in tok_ns_to_s_map.items():
+        tok_s_to_ns_map[tok_index] = i
+
+    orig_start_position = None
+    if start_position in tok_s_to_ns_map:
+        ns_start_position = tok_s_to_ns_map[start_position]
+        if ns_start_position in orig_ns_to_s_map:
+            orig_start_position = orig_ns_to_s_map[ns_start_position]
+
+    if orig_start_position is None:
+        if verbose_logging:
+            logger.info("Couldn't map start position")
+        return orig_text
+
+    orig_end_position = None
+    if end_position in tok_s_to_ns_map:
+        ns_end_position = tok_s_to_ns_map[end_position]
+        if ns_end_position in orig_ns_to_s_map:
+            orig_end_position = orig_ns_to_s_map[ns_end_position]
+
+    if orig_end_position is None:
+        if verbose_logging:
+            logger.info("Couldn't map end position")
+        return orig_text
+
+    output_text = orig_text[orig_start_position:(orig_end_position + 1)]
+    return output_text
+
+
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)
+
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+
+
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
diff --git a/Optimus/code/examples/utils_squad_evaluate.py b/Optimus/code/examples/utils_squad_evaluate.py
new file mode 100755
index 0000000000000000000000000000000000000000..ed162e6fe600e7c7e642bc001aaf1dde2b9620b0
--- /dev/null
+++ b/Optimus/code/examples/utils_squad_evaluate.py
@@ -0,0 +1,330 @@
+""" Official evaluation script for SQuAD version 2.0.
+    Modified by XLNet authors to update `find_best_threshold` scripts for SQuAD V2.0
+
+In addition to basic functionality, we also compute additional statistics and
+plot precision-recall curves if an additional na_prob.json file is provided.
+This file is expected to map question ID's to the model's predicted probability
+that a question is unanswerable.
+"""
+import argparse
+import collections
+import json
+import numpy as np
+import os
+import re
+import string
+import sys
+
+class EVAL_OPTS():
+  def __init__(self, data_file, pred_file, out_file="",
+               na_prob_file="na_prob.json", na_prob_thresh=1.0,
+               out_image_dir=None, verbose=False):
+    self.data_file = data_file
+    self.pred_file = pred_file
+    self.out_file = out_file
+    self.na_prob_file = na_prob_file
+    self.na_prob_thresh = na_prob_thresh
+    self.out_image_dir = out_image_dir
+    self.verbose = verbose
+
+OPTS = None
+
+def parse_args():
+  parser = argparse.ArgumentParser('Official evaluation script for SQuAD version 2.0.')
+  parser.add_argument('data_file', metavar='data.json', help='Input data JSON file.')
+  parser.add_argument('pred_file', metavar='pred.json', help='Model predictions.')
+  parser.add_argument('--out-file', '-o', metavar='eval.json',
+                      help='Write accuracy metrics to file (default is stdout).')
+  parser.add_argument('--na-prob-file', '-n', metavar='na_prob.json',
+                      help='Model estimates of probability of no answer.')
+  parser.add_argument('--na-prob-thresh', '-t', type=float, default=1.0,
+                      help='Predict "" if no-answer probability exceeds this (default = 1.0).')
+  parser.add_argument('--out-image-dir', '-p', metavar='out_images', default=None,
+                      help='Save precision-recall curves to directory.')
+  parser.add_argument('--verbose', '-v', action='store_true')
+  if len(sys.argv) == 1:
+    parser.print_help()
+    sys.exit(1)
+  return parser.parse_args()
+
+def make_qid_to_has_ans(dataset):
+  qid_to_has_ans = {}
+  for article in dataset:
+    for p in article['paragraphs']:
+      for qa in p['qas']:
+        qid_to_has_ans[qa['id']] = bool(qa['answers'])
+  return qid_to_has_ans
+
+def normalize_answer(s):
+  """Lower text and remove punctuation, articles and extra whitespace."""
+  def remove_articles(text):
+    regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
+    return re.sub(regex, ' ', text)
+  def white_space_fix(text):
+    return ' '.join(text.split())
+  def remove_punc(text):
+    exclude = set(string.punctuation)
+    return ''.join(ch for ch in text if ch not in exclude)
+  def lower(text):
+    return text.lower()
+  return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+def get_tokens(s):
+  if not s: return []
+  return normalize_answer(s).split()
+
+def compute_exact(a_gold, a_pred):
+  return int(normalize_answer(a_gold) == normalize_answer(a_pred))
+
+def compute_f1(a_gold, a_pred):
+  gold_toks = get_tokens(a_gold)
+  pred_toks = get_tokens(a_pred)
+  common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
+  num_same = sum(common.values())
+  if len(gold_toks) == 0 or len(pred_toks) == 0:
+    # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
+    return int(gold_toks == pred_toks)
+  if num_same == 0:
+    return 0
+  precision = 1.0 * num_same / len(pred_toks)
+  recall = 1.0 * num_same / len(gold_toks)
+  f1 = (2 * precision * recall) / (precision + recall)
+  return f1
+
+def get_raw_scores(dataset, preds):
+  exact_scores = {}
+  f1_scores = {}
+  for article in dataset:
+    for p in article['paragraphs']:
+      for qa in p['qas']:
+        qid = qa['id']
+        gold_answers = [a['text'] for a in qa['answers']
+                        if normalize_answer(a['text'])]
+        if not gold_answers:
+          # For unanswerable questions, only correct answer is empty string
+          gold_answers = ['']
+        if qid not in preds:
+          print('Missing prediction for %s' % qid)
+          continue
+        a_pred = preds[qid]
+        # Take max over all gold answers
+        exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
+        f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
+  return exact_scores, f1_scores
+
+def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
+  new_scores = {}
+  for qid, s in scores.items():
+    pred_na = na_probs[qid] > na_prob_thresh
+    if pred_na:
+      new_scores[qid] = float(not qid_to_has_ans[qid])
+    else:
+      new_scores[qid] = s
+  return new_scores
+
+def make_eval_dict(exact_scores, f1_scores, qid_list=None):
+  if not qid_list:
+    total = len(exact_scores)
+    return collections.OrderedDict([
+        ('exact', 100.0 * sum(exact_scores.values()) / total),
+        ('f1', 100.0 * sum(f1_scores.values()) / total),
+        ('total', total),
+    ])
+  else:
+    total = len(qid_list)
+    return collections.OrderedDict([
+        ('exact', 100.0 * sum(exact_scores[k] for k in qid_list) / total),
+        ('f1', 100.0 * sum(f1_scores[k] for k in qid_list) / total),
+        ('total', total),
+    ])
+
+def merge_eval(main_eval, new_eval, prefix):
+  for k in new_eval:
+    main_eval['%s_%s' % (prefix, k)] = new_eval[k]
+
+def plot_pr_curve(precisions, recalls, out_image, title):
+  plt.step(recalls, precisions, color='b', alpha=0.2, where='post')
+  plt.fill_between(recalls, precisions, step='post', alpha=0.2, color='b')
+  plt.xlabel('Recall')
+  plt.ylabel('Precision')
+  plt.xlim([0.0, 1.05])
+  plt.ylim([0.0, 1.05])
+  plt.title(title)
+  plt.savefig(out_image)
+  plt.clf()
+
+def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans,
+                               out_image=None, title=None):
+  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+  true_pos = 0.0
+  cur_p = 1.0
+  cur_r = 0.0
+  precisions = [1.0]
+  recalls = [0.0]
+  avg_prec = 0.0
+  for i, qid in enumerate(qid_list):
+    if qid_to_has_ans[qid]:
+      true_pos += scores[qid]
+    cur_p = true_pos / float(i+1)
+    cur_r = true_pos / float(num_true_pos)
+    if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i+1]]:
+      # i.e., if we can put a threshold after this point
+      avg_prec += cur_p * (cur_r - recalls[-1])
+      precisions.append(cur_p)
+      recalls.append(cur_r)
+  if out_image:
+    plot_pr_curve(precisions, recalls, out_image, title)
+  return {'ap': 100.0 * avg_prec}
+
+def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, 
+                                  qid_to_has_ans, out_image_dir):
+  if out_image_dir and not os.path.exists(out_image_dir):
+    os.makedirs(out_image_dir)
+  num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
+  if num_true_pos == 0:
+    return
+  pr_exact = make_precision_recall_eval(
+      exact_raw, na_probs, num_true_pos, qid_to_has_ans,
+      out_image=os.path.join(out_image_dir, 'pr_exact.png'),
+      title='Precision-Recall curve for Exact Match score')
+  pr_f1 = make_precision_recall_eval(
+      f1_raw, na_probs, num_true_pos, qid_to_has_ans,
+      out_image=os.path.join(out_image_dir, 'pr_f1.png'),
+      title='Precision-Recall curve for F1 score')
+  oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
+  pr_oracle = make_precision_recall_eval(
+      oracle_scores, na_probs, num_true_pos, qid_to_has_ans,
+      out_image=os.path.join(out_image_dir, 'pr_oracle.png'),
+      title='Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)')
+  merge_eval(main_eval, pr_exact, 'pr_exact')
+  merge_eval(main_eval, pr_f1, 'pr_f1')
+  merge_eval(main_eval, pr_oracle, 'pr_oracle')
+
+def histogram_na_prob(na_probs, qid_list, image_dir, name):
+  if not qid_list:
+    return
+  x = [na_probs[k] for k in qid_list]
+  weights = np.ones_like(x) / float(len(x))
+  plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
+  plt.xlabel('Model probability of no-answer')
+  plt.ylabel('Proportion of dataset')
+  plt.title('Histogram of no-answer probability: %s' % name)
+  plt.savefig(os.path.join(image_dir, 'na_prob_hist_%s.png' % name))
+  plt.clf()
+
+def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
+  num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+  cur_score = num_no_ans
+  best_score = cur_score
+  best_thresh = 0.0
+  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+  for i, qid in enumerate(qid_list):
+    if qid not in scores: continue
+    if qid_to_has_ans[qid]:
+      diff = scores[qid]
+    else:
+      if preds[qid]:
+        diff = -1
+      else:
+        diff = 0
+    cur_score += diff
+    if cur_score > best_score:
+      best_score = cur_score
+      best_thresh = na_probs[qid]
+  return 100.0 * best_score / len(scores), best_thresh
+
+def find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):
+  num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+  cur_score = num_no_ans
+  best_score = cur_score
+  best_thresh = 0.0
+  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+  for i, qid in enumerate(qid_list):
+    if qid not in scores: continue
+    if qid_to_has_ans[qid]:
+      diff = scores[qid]
+    else:
+      if preds[qid]:
+        diff = -1
+      else:
+        diff = 0
+    cur_score += diff
+    if cur_score > best_score:
+      best_score = cur_score
+      best_thresh = na_probs[qid]
+
+  has_ans_score, has_ans_cnt = 0, 0
+  for qid in qid_list:
+    if not qid_to_has_ans[qid]: continue
+    has_ans_cnt += 1
+
+    if qid not in scores: continue
+    has_ans_score += scores[qid]
+
+  return 100.0 * best_score / len(scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt
+
+def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
+  best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
+  best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
+  main_eval['best_exact'] = best_exact
+  main_eval['best_exact_thresh'] = exact_thresh
+  main_eval['best_f1'] = best_f1
+  main_eval['best_f1_thresh'] = f1_thresh
+
+def find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
+  best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(preds, exact_raw, na_probs, qid_to_has_ans)
+  best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(preds, f1_raw, na_probs, qid_to_has_ans)
+  main_eval['best_exact'] = best_exact
+  main_eval['best_exact_thresh'] = exact_thresh
+  main_eval['best_f1'] = best_f1
+  main_eval['best_f1_thresh'] = f1_thresh
+  main_eval['has_ans_exact'] = has_ans_exact
+  main_eval['has_ans_f1'] = has_ans_f1
+
+def main(OPTS):
+  with open(OPTS.data_file) as f:
+    dataset_json = json.load(f)
+    dataset = dataset_json['data']
+  with open(OPTS.pred_file) as f:
+    preds = json.load(f)
+  if OPTS.na_prob_file:
+    with open(OPTS.na_prob_file) as f:
+      na_probs = json.load(f)
+  else:
+    na_probs = {k: 0.0 for k in preds}
+  qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
+  has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+  no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+  exact_raw, f1_raw = get_raw_scores(dataset, preds)
+  exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans,
+                                        OPTS.na_prob_thresh)
+  f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans,
+                                     OPTS.na_prob_thresh)
+  out_eval = make_eval_dict(exact_thresh, f1_thresh)
+  if has_ans_qids:
+    has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
+    merge_eval(out_eval, has_ans_eval, 'HasAns')
+  if no_ans_qids:
+    no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
+    merge_eval(out_eval, no_ans_eval, 'NoAns')
+  if OPTS.na_prob_file:
+    find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
+  if OPTS.na_prob_file and OPTS.out_image_dir:
+    run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs, 
+                                  qid_to_has_ans, OPTS.out_image_dir)
+    histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, 'hasAns')
+    histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, 'noAns')
+  if OPTS.out_file:
+    with open(OPTS.out_file, 'w') as f:
+      json.dump(out_eval, f)
+  else:
+    print(json.dumps(out_eval, indent=2))
+  return out_eval
+
+if __name__ == '__main__':
+  OPTS = parse_args()
+  if OPTS.out_image_dir:
+    import matplotlib
+    matplotlib.use('Agg')
+    import matplotlib.pyplot as plt 
+  main(OPTS)
diff --git a/Optimus/code/modules/__init__.py b/Optimus/code/modules/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..46f9c1042373aa646f5a4ee3eb3ea422f51f1212
--- /dev/null
+++ b/Optimus/code/modules/__init__.py
@@ -0,0 +1,7 @@
+from .encoders import *
+from .decoders import *
+from .vae import *
+from .utils import *
+from .spacefusion import *
+from .cara import *
+from .arae import *
diff --git a/Optimus/code/modules/__pycache__/__init__.cpython-310.pyc b/Optimus/code/modules/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1ba42d21cde16e3bd7801ac4426c0eef390e40a9
Binary files /dev/null and b/Optimus/code/modules/__pycache__/__init__.cpython-310.pyc differ
diff --git a/Optimus/code/modules/__pycache__/__init__.cpython-37.pyc b/Optimus/code/modules/__pycache__/__init__.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6ec9b3056bc39177e6518b01b985b266ce1e5ef3
Binary files /dev/null and b/Optimus/code/modules/__pycache__/__init__.cpython-37.pyc differ
diff --git a/Optimus/code/modules/__pycache__/arae.cpython-310.pyc b/Optimus/code/modules/__pycache__/arae.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..801031c274812b5cf060c94d33f13511d6567425
Binary files /dev/null and b/Optimus/code/modules/__pycache__/arae.cpython-310.pyc differ
diff --git a/Optimus/code/modules/__pycache__/arae.cpython-37.pyc b/Optimus/code/modules/__pycache__/arae.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3389d724cd4df7547c226384572798c34b7183b7
Binary files /dev/null and b/Optimus/code/modules/__pycache__/arae.cpython-37.pyc differ
diff --git a/Optimus/code/modules/__pycache__/cara.cpython-310.pyc b/Optimus/code/modules/__pycache__/cara.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f24f369a830bca638277735746d6922ceccde9e9
Binary files /dev/null and b/Optimus/code/modules/__pycache__/cara.cpython-310.pyc differ
diff --git a/Optimus/code/modules/__pycache__/cara.cpython-37.pyc b/Optimus/code/modules/__pycache__/cara.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3bbaf35740cd9a8ec1f085535956cf300172bc4d
Binary files /dev/null and b/Optimus/code/modules/__pycache__/cara.cpython-37.pyc differ
diff --git a/Optimus/code/modules/__pycache__/spacefusion.cpython-310.pyc b/Optimus/code/modules/__pycache__/spacefusion.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1beef888d9a12b5a9fbd1ff0d79619f0a5e20002
Binary files /dev/null and b/Optimus/code/modules/__pycache__/spacefusion.cpython-310.pyc differ
diff --git a/Optimus/code/modules/__pycache__/spacefusion.cpython-37.pyc b/Optimus/code/modules/__pycache__/spacefusion.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..003634076fb0330e56a01f033f3a4a2cab7f29f2
Binary files /dev/null and b/Optimus/code/modules/__pycache__/spacefusion.cpython-37.pyc differ
diff --git a/Optimus/code/modules/__pycache__/utils.cpython-310.pyc b/Optimus/code/modules/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..568fbad325f4808b53c0f35ebc806c3ba3cc5c8f
Binary files /dev/null and b/Optimus/code/modules/__pycache__/utils.cpython-310.pyc differ
diff --git a/Optimus/code/modules/__pycache__/utils.cpython-37.pyc b/Optimus/code/modules/__pycache__/utils.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bf99c8a2321a501fa96f8408027dd156b6facd60
Binary files /dev/null and b/Optimus/code/modules/__pycache__/utils.cpython-37.pyc differ
diff --git a/Optimus/code/modules/__pycache__/vae.cpython-310.pyc b/Optimus/code/modules/__pycache__/vae.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7cf9d4a52c51b3191af36090bfe612af771b0d54
Binary files /dev/null and b/Optimus/code/modules/__pycache__/vae.cpython-310.pyc differ
diff --git a/Optimus/code/modules/__pycache__/vae.cpython-37.pyc b/Optimus/code/modules/__pycache__/vae.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..aeaa12679289478e82000326a3392738ba04c6cd
Binary files /dev/null and b/Optimus/code/modules/__pycache__/vae.cpython-37.pyc differ
diff --git a/Optimus/code/modules/arae.py b/Optimus/code/modules/arae.py
new file mode 100755
index 0000000000000000000000000000000000000000..cc4ee4e5f44c47e56903912f184d8be3345cf5a0
--- /dev/null
+++ b/Optimus/code/modules/arae.py
@@ -0,0 +1,274 @@
+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+
+
+class ARAE(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(ARAE, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask=None):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+
+        # Latent discriminator
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+
+        # Latent classifier
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+
+        # Generate based on encoded z and gt labels. (reconstruction)
+        past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+
+        past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+
+        # Train a classifier in the observation space
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+
+        # Loss
+        loss = loss_rec + loss_encoder + loss_lsc + loss_lsd + loss_lsg + loss_cls
+
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+
+            return result
+
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+        }
+        return loss_dict, acc_dict
+
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated    # (B, seq_len)
+
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+
+        top_k = min(top_k, logits.size(-1))  # Safety check
+
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+
+            logits.masked_fill_(indices_to_remove, filter_value)
+
+        return logits
diff --git a/Optimus/code/modules/cara.py b/Optimus/code/modules/cara.py
new file mode 100755
index 0000000000000000000000000000000000000000..ef480533d32bf80310ce51b127b14a67def2a91c
--- /dev/null
+++ b/Optimus/code/modules/cara.py
@@ -0,0 +1,374 @@
+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+
+
+class CARA(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(CARA, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+
+        #################### Latent discriminator for sampling from a simple distribution #################### 
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+
+        ####################  Latent classifier for disentanglement #################### 
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+
+
+        #################### Recontruction loss with latent z and label emb #################### 
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        # past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        # past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        past_sampled_label = sampled_label_emb
+
+        # Generate based on encoded z and gt labels. (reconstruction)
+        # past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        past_z = latent_z
+        # gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = gen_z    # (B, n_blocks * hidden_size)
+
+        # past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+
+        past = latent_z + label_emb # (B, n_blocks * hidden_size)
+
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+
+        ####################  Train a classifier in the observation space #################### 
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+
+        # Generate based on encoded z and sampled labels (attribute transfer)
+        # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        # at_generated_soft = self.sample_sequence_conditional_batch_soft(past=at_past, context=self.bos_token_id_list) # (B, seq_len, vocab_size)
+
+        # # Classifier on attribute transfer generated sentences. Train Generator on attribute transfer.
+        # at_soft_emb = torch.matmul(at_generated_soft, self.gpt_embeddings.weight)
+        # at_soft_encode = self.conv1(at_soft_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        # at_soft_encode = torch.mean(at_soft_encode, dim=-1)   # (B, dim_h)
+        # prob_at_soft_cls = self.classifier(at_soft_encode)    # (B, 1)
+        # if self.args.label_size <= 2:
+        #     prob_at_soft_cls = prob_at_soft_cls.squeeze(1)
+        #     loss_at_soft_cls = self.BCEWithLogitsLoss(prob_at_soft_cls, sampled_cond_labels.float())
+        #     pred_at_soft_cls = (prob_at_soft_cls >= 0).to(torch.long)
+        # else:
+        #     loss_at_soft_cls = self.CrossEntropyLoss(prob_at_soft_cls, sampled_cond_labels)
+        #     pred_at_soft_cls = torch.argmax(prob_at_soft_cls, dim=-1)
+        # acc_at_soft_cls = (pred_at_soft_cls == sampled_cond_labels).float()
+
+        # Loss
+        loss_latent_space = (loss_encoder + loss_lsc) + (loss_lsd + loss_lsg) + self.args.beta_cls * loss_cls # + loss_at_soft_cls
+        loss = loss_rec + 0.0 * loss_latent_space
+
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_past = past_z + past_sampled_label # (B, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            # cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_past = gen_past_z +  past_sampled_label # (B, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+
+            return result
+
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+                # 'loss_at_soft_cls': loss_at_soft_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+                # 'acc_at_soft_cls': acc_at_soft_cls,
+        }
+        return loss_dict, acc_dict
+
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+
+            # softmax sample
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated    # (B, seq_len)
+
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+
+        top_k = min(top_k, logits.size(-1))  # Safety check
+
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+
+            logits.masked_fill_(indices_to_remove, filter_value)
+
+        return logits
+
+    def sample_sequence_conditional_batch_soft(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device).unsqueeze(0).repeat(num_samples, 1)     # (B, 1)
+        context_soft = torch.FloatTensor(num_samples, self.decoder.config.vocab_size).zero_().to(device=past.device)    # (B, vocab_size)
+        context_soft.scatter_(1, context, 1)  # (B, vocab_size)
+        generated_soft = context_soft.unsqueeze(1) # (B, 1, vocab_size)
+
+        # with torch.no_grad():
+        while generated_soft.size(1) < self.args.block_size:    # generated_soft: (B, seq_len, vocab_size)
+            inputs = {'soft_ids': generated_soft, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]  # (B, seq_len, vocab_size)
+
+            # Gumbel softmax sample
+            next_tokens_soft = gumbel_softmax(logits=lm_logits[:, -1:, :], temperature=self.args.soft_temperature, hard=False)  # (B, 1, vocab_size)
+            generated_soft = torch.cat((generated_soft, next_tokens_soft), dim=1)   # (B, seq_len+1, vocab_size)
+
+            # # softmax sample
+            # next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            # filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            # filtered_logits = F.softmax(filtered_logits, dim=-1)
+            # next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            # generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            next_tokens = torch.argmax(next_tokens_soft, dim=-1)    # (B, 1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated_soft    # (B, seq_len, vocab_size)
+
+
+### Gumbel Softmax
+def gumbel_softmax(logits, temperature, hard=False):
+    """Sample from the Gumbel-Softmax distribution and optionally discretize.
+        Args:
+            logits: [..., n_class] unnormalized log-probs
+            temperature: non-negative scalar
+            hard: if True, take argmax, but differentiate w.r.t. soft sample y
+        Returns:
+            [..., n_class] sample from the Gumbel-Softmax distribution.
+            If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
+    """
+    y = gumbel_softmax_sample(logits, temperature)  # (..., n_class)
+
+    if hard:    # return onehot
+        shape = y.size()
+        _, ind = y.max(dim=-1)
+        y_hard = torch.zeros_like(y).view(-1, shape[-1])
+        y_hard.scatter_(1, ind.view(-1, 1), 1)  # one hot
+        y_hard = y_hard.view(*shape)
+        # Set gradients w.r.t. y_hard gradients w.r.t. y
+        y = (y_hard - y).detach() + y
+
+    return y    # (..., n_class)
+
+from torch.nn import functional as F
+def gumbel_softmax_sample(logits, temperature):
+    y = logits + sample_gumbel(logits.size(), logits.device)
+    return F.softmax(y / temperature, dim=-1)
+
+def sample_gumbel(shape, device, eps=1e-20):
+    U = torch.rand(shape).to(device=device)
+    return -torch.log(-torch.log(U + eps) + eps)
diff --git a/Optimus/code/modules/ctrl_gen.py b/Optimus/code/modules/ctrl_gen.py
new file mode 100755
index 0000000000000000000000000000000000000000..2b828132a0d208f9aacec9d70151b8e1562cfcc1
--- /dev/null
+++ b/Optimus/code/modules/ctrl_gen.py
@@ -0,0 +1,371 @@
+import math
+import torch
+import torch.nn as nn
+from .utils import log_sum_exp
+import pdb
+import sys
+sys.path.append('../../')
+from pytorch_transformers.modeling_bert import BertEmbeddings
+import torch.nn.functional as F
+
+
+class Ctrl_Gen(nn.Module):
+    def __init__(self, encoder, decoder, tokenizer_encoder, tokenizer_decoder, args): #
+        super(Ctrl_Gen, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+        self.tokenizer_encoder = tokenizer_encoder
+        self.tokenizer_decoder = tokenizer_decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.bos_token_id_list = self.tokenizer_decoder.encode(self.tokenizer_decoder.bos_token)
+        self.pad_token_id = self.tokenizer_decoder.encode(self.tokenizer_decoder.pad_token)[0]
+
+        # connector: from Bert hidden units to the latent space
+        self.linear = nn.Linear(encoder.config.hidden_size, self.nz, bias=False)
+
+        # # Standard Normal prior
+        # loc = torch.zeros(self.nz, device=args.device)
+        # scale = torch.ones(self.nz, device=args.device)
+        # self.prior = torch.distributions.normal.Normal(loc, scale)
+
+        self.label_embedding = nn.Embedding(args.label_size, self.nz, padding_idx=0)    # use the same size as latent_z so as to use the same decoder.linear()
+        self.latent_generator = nn.Linear(self.nz, self.nz)
+        self.latent_classifier = nn.Linear(self.nz, args.label_size if args.label_size > 2 else 1)
+        self.latent_discriminator = nn.Linear(self.nz, 1)
+
+        self.gpt_embeddings = nn.Embedding(self.decoder.config.vocab_size, self.decoder.config.n_embd)
+        self.gpt_embeddings.weight.data = decoder.transformer.wte.weight.data
+
+        self.conv1 = nn.Conv1d(self.encoder.config.hidden_size, self.encoder.config.hidden_size, 3)
+        self.classifier = nn.Linear(self.encoder.config.hidden_size, 1 if args.label_size <= 2 else args.label_size)
+
+        self.CrossEntropyLoss = torch.nn.CrossEntropyLoss()
+        self.BCEWithLogitsLoss = torch.nn.BCEWithLogitsLoss()
+
+    def forward(self, input_seq_ids, tgt_seq_ids, cond_labels, attention_mask):
+        # inputs: (B, seq_len)
+        # labels: (B, seq_len)
+        # cond_labels: (B), conditional labels.
+
+        ones_label = torch.ones_like(cond_labels).to(dtype=torch.float32)
+        zeros_label = torch.zeros_like(cond_labels).to(dtype=torch.float32)
+        random_noise = torch.nn.init.normal_(torch.empty(input_seq_ids.size(0), self.nz)).to(device=input_seq_ids.device, dtype=torch.float32)
+
+        # Encode inputs
+        outputs = self.encoder(input_seq_ids, attention_mask=attention_mask)
+        pooled_hidden_fea = outputs[1]  # (B, dim_h)
+
+        # Encode z
+        latent_z = self.linear(pooled_hidden_fea)    # (B, nz)
+
+        # Generate z
+        gen_z = self.latent_generator(random_noise)  # (B, nz)
+
+        # Latent discriminator
+        prob_encode_z_dis = self.latent_discriminator(latent_z).squeeze(1).float()  # (B)
+        prob_gen_z_dis = self.latent_discriminator(gen_z).squeeze(1).float()  # (B)
+        # Train latent discriminator
+        loss_lsd = self.BCEWithLogitsLoss(prob_gen_z_dis, zeros_label) + self.BCEWithLogitsLoss(prob_encode_z_dis, ones_label)
+        acc_encode_z_dis = ((prob_encode_z_dis >= 0).float() == ones_label).float()
+        acc_gen_z_dis = ((prob_gen_z_dis >= 0).float() == zeros_label).float()
+        # Train sampler adversarially
+        loss_lsg = self.BCEWithLogitsLoss(prob_gen_z_dis, ones_label)
+
+        # Latent classifier
+        prob_encode_z_cls = self.latent_classifier(latent_z)  # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_encode_z_cls = prob_encode_z_cls.squeeze(1)  # (B)
+            # Train latent classifier
+            loss_lsc = self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+            acc_encode_z_cls = ((prob_encode_z_cls >= 0).float() == cond_labels.float()).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.BCEWithLogitsLoss(prob_encode_z_cls, cond_labels.float())
+        else:
+            # Train latent classifier
+            loss_lsc = self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+            acc_encode_z_cls = (torch.argmax(prob_encode_z_cls, dim=-1) == cond_labels).float()
+            # Train encoder adversarially
+            loss_encoder = 1 - self.CrossEntropyLoss(prob_encode_z_cls, cond_labels)
+
+        # Embed labels
+        label_emb = self.label_embedding(cond_labels)  # (B, hidden_size)
+        # past_label = self.decoder.linear(label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        if self.args.label_size <= 2:
+            sampled_cond_labels = 1 - cond_labels
+        else:
+            raise NotImplementedError    # todo: currently only implemented for binary labels. need to change for multi-class labels.
+        sampled_label_emb = self.label_embedding(sampled_cond_labels)  # (B, hidden_size)
+        # past_sampled_label = self.decoder.linear(sampled_label_emb)    # (B, n_blocks * hidden_size)  # todo: use the same linear layer for latent_z for now.
+        past_sampled_label = sampled_label_emb
+
+        # Generate based on encoded z and gt labels. (reconstruction)
+        # past_z = self.decoder.linear(latent_z)    # (B, n_blocks * hidden_size)
+        past_z = latent_z
+        # gen_past_z = self.decoder.linear(gen_z)    # (B, n_blocks * hidden_size)
+        gen_past_z = gen_z    # (B, n_blocks * hidden_size)
+
+        # past = torch.cat([past_z.unsqueeze(1), past_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+
+        past = latent_z + label_emb # (B, n_blocks * hidden_size)
+
+        outputs = self.decoder(input_ids=tgt_seq_ids, past=past, labels=tgt_seq_ids, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+
+        # Train a classifier in the observation space
+        tgt_emb = self.gpt_embeddings(tgt_seq_ids)
+        tgt_encode = self.conv1(tgt_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        tgt_encode = torch.mean(tgt_encode, dim=-1) # (B, dim_h)
+        prob_cls = self.classifier(tgt_encode)   # (B, n_labels)
+        if self.args.label_size <= 2:
+            prob_cls = prob_cls.squeeze(1)
+            loss_cls = self.BCEWithLogitsLoss(prob_cls, cond_labels.float())
+            pred_cls = (prob_cls >= 0).to(dtype=torch.long)
+        else:
+            loss_cls = self.CrossEntropyLoss(prob_cls, cond_labels)
+            pred_cls = torch.argmax(prob_cls, dim=-1)
+        acc_cls = (pred_cls == cond_labels).float()
+
+        # Generate based on encoded z and sampled labels (attribute transfer)
+        # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+        # at_generated_soft = self.sample_sequence_conditional_batch_soft(past=at_past, context=self.bos_token_id_list) # (B, seq_len, vocab_size)
+
+        # # Classifier on attribute transfer generated sentences. Train Generator on attribute transfer.
+        # at_soft_emb = torch.matmul(at_generated_soft, self.gpt_embeddings.weight)
+        # at_soft_encode = self.conv1(at_soft_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+        # at_soft_encode = torch.mean(at_soft_encode, dim=-1)   # (B, dim_h)
+        # prob_at_soft_cls = self.classifier(at_soft_encode)    # (B, 1)
+        # if self.args.label_size <= 2:
+        #     prob_at_soft_cls = prob_at_soft_cls.squeeze(1)
+        #     loss_at_soft_cls = self.BCEWithLogitsLoss(prob_at_soft_cls, sampled_cond_labels.float())
+        #     pred_at_soft_cls = (prob_at_soft_cls >= 0).to(torch.long)
+        # else:
+        #     loss_at_soft_cls = self.CrossEntropyLoss(prob_at_soft_cls, sampled_cond_labels)
+        #     pred_at_soft_cls = torch.argmax(prob_at_soft_cls, dim=-1)
+        # acc_at_soft_cls = (pred_at_soft_cls == sampled_cond_labels).float()
+
+        # Loss
+        loss = loss_rec + loss_encoder + loss_lsc + loss_lsd + loss_lsg + self.args.beta_cls * loss_cls # + loss_at_soft_cls
+
+        if not self.training:
+            # Generate based on encoded z and gt labels
+            generated = self.sample_sequence_conditional_batch(past=past, context=self.bos_token_id_list)
+
+            # Generate based on encoded z and sampled labels (attribute transfer)
+            # at_past = torch.cat([past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            at_past = past_z + past_sampled_label # (B, n_blocks * hidden_size)
+            at_generated = self.sample_sequence_conditional_batch(past=at_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # Generate based on sampled z and sampled labels. (conditional generation)
+            # cg_past = torch.cat([gen_past_z.unsqueeze(1), past_sampled_label.unsqueeze(1)], dim=1) # (B, 2, n_blocks * hidden_size)
+            cg_past = gen_past_z +  past_sampled_label # (B, n_blocks * hidden_size)
+            cg_generated = self.sample_sequence_conditional_batch(past=cg_past, context=self.bos_token_id_list) # (B, seq_len)
+
+            # classifier on gt generated sentences.
+            ge_emb = self.gpt_embeddings(generated)
+            ge_encode = self.conv1(ge_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            ge_encode = torch.mean(ge_encode, dim=-1)   # (B, dim_h)
+            prob_ge_cls = self.classifier(ge_encode)    # (B, 1)
+
+            if self.args.label_size <= 2:
+                pred_ge_cls = (prob_ge_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_ge_cls = torch.argmax(prob_ge_cls, dim=-1)
+            acc_ge_cls = (pred_ge_cls == cond_labels).float()
+
+            # classifier on attribute transfer generated sentences.
+            at_emb = self.gpt_embeddings(at_generated)
+            at_encode = self.conv1(at_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            at_encode = torch.mean(at_encode, dim=-1)   # (B, dim_h)
+            prob_at_cls = self.classifier(at_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_at_cls = (prob_at_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_at_cls = torch.argmax(prob_at_cls, dim=-1)
+            acc_at_cls = (pred_at_cls == sampled_cond_labels).float()
+
+            # classifier on conditional generated sentences.
+            cg_emb = self.gpt_embeddings(cg_generated)
+            cg_encode = self.conv1(cg_emb.transpose(1, 2))    # (B, dim_h, seq_len)
+            cg_encode = torch.mean(cg_encode, dim=-1)   # (B, dim_h)
+            prob_cg_cls = self.classifier(cg_encode)    # (B, 1)
+            if self.args.label_size <= 2:
+                pred_cg_cls = (prob_cg_cls.squeeze(1) >= 0).to(torch.long)
+            else:
+                pred_cg_cls = torch.argmax(prob_cg_cls, dim=-1)
+            acc_cg_cls = (pred_cg_cls == sampled_cond_labels).float()
+
+            result = {
+                    'sampled_cond_labels': sampled_cond_labels,
+                    'cond_labels': cond_labels,
+
+                    'tgt_seq_ids': tgt_seq_ids,
+                    'generated': generated,
+                    'at_generated': at_generated,
+                    'cg_generated': cg_generated,
+
+                    'acc_encode_z_dis': acc_encode_z_dis,
+                    'acc_gen_z_dis': acc_gen_z_dis,
+                    'acc_encode_z_cls': acc_encode_z_cls,
+                    'acc_cls': acc_cls,
+                    'acc_ge_cls': acc_ge_cls,
+                    'acc_at_cls': acc_at_cls,
+                    'acc_cg_cls': acc_cg_cls,
+
+                    'pred_cls': pred_cls,
+                    'pred_ge_cls': pred_ge_cls,
+                    'pred_at_cls': pred_at_cls,
+                    'pred_cg_cls': pred_cg_cls,
+                    }
+
+            return result
+
+        loss_dict = {
+                'loss': loss,
+                'loss_rec': loss_rec,
+                'loss_encoder': loss_encoder,
+                'loss_lsc': loss_lsc,
+                'loss_lsd': loss_lsd,
+                'loss_lsg': loss_lsg,
+                'loss_cls': loss_cls,
+                # 'loss_at_soft_cls': loss_at_soft_cls,
+        }
+        acc_dict = {
+                'acc_encode_z_dis': acc_encode_z_dis,
+                'acc_gen_z_dis': acc_gen_z_dis,
+                'acc_encode_z_cls': acc_encode_z_cls,
+                'acc_cls': acc_cls,
+                # 'acc_at_soft_cls': acc_at_soft_cls,
+        }
+        return loss_dict, acc_dict
+
+    def sample_sequence_conditional_batch(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device)
+        context = context.unsqueeze(0).repeat(num_samples, 1)
+        generated = context # (B, 1)
+
+        # with torch.no_grad():
+        while generated.size(-1) < self.args.block_size:
+            inputs = {'input_ids': generated, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]
+
+            # softmax sample
+            next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            filtered_logits = F.softmax(filtered_logits, dim=-1)
+            next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated    # (B, seq_len)
+
+    def top_k_top_p_filtering_batch(self, logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+        """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+            Args:
+                logits: logits distribution shape (vocabulary size)
+                top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+                top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+        """
+        # assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
+
+        top_k = min(top_k, logits.size(-1))  # Safety check
+
+        if top_k > 0:
+            # Remove all tokens with a probability less than the last token of the top-k
+            threshold = torch.topk(logits, top_k, dim=-1)[0][:, -1, None]
+            logits.masked_fill_(logits < threshold, filter_value)   #  (B, vocab_size)
+
+        if top_p > 0.0:
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True)         # (B, vocab_size)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)   # (B, vocab_size)
+
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+
+            # Shift the indices to the right to keep also the first token above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+
+            logits.masked_fill_(indices_to_remove, filter_value)
+
+        return logits
+
+    def sample_sequence_conditional_batch_soft(self, past, context):
+        # context: a single id of <BOS>
+        # past: (B, past_seq_len dim_h)
+        num_samples = past.size(0)
+        context = torch.tensor(context, dtype=torch.long, device=past.device).unsqueeze(0).repeat(num_samples, 1)     # (B, 1)
+        context_soft = torch.FloatTensor(num_samples, self.decoder.config.vocab_size).zero_().to(device=past.device)    # (B, vocab_size)
+        context_soft.scatter_(1, context, 1)  # (B, vocab_size)
+        generated_soft = context_soft.unsqueeze(1) # (B, 1, vocab_size)
+
+        # with torch.no_grad():
+        while generated_soft.size(1) < self.args.block_size:    # generated_soft: (B, seq_len, vocab_size)
+            inputs = {'soft_ids': generated_soft, 'past': past}
+            outputs = self.decoder(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+            lm_logits = outputs[0]  # (B, seq_len, vocab_size)
+
+            # Gumbel softmax sample
+            next_tokens_soft = gumbel_softmax(logits=lm_logits[:, -1:, :], temperature=self.args.soft_temperature, hard=False)  # (B, 1, vocab_size)
+            generated_soft = torch.cat((generated_soft, next_tokens_soft), dim=1)   # (B, seq_len+1, vocab_size)
+
+            # # softmax sample
+            # next_tokens_logits = lm_logits[:, -1, :] / self.args.temperature  # (B, 1, vocab_size)
+            # filtered_logits = self.top_k_top_p_filtering_batch(next_tokens_logits, top_k=self.args.top_k, top_p=self.args.top_p)  # (B, 1, vocab_size)
+            # filtered_logits = F.softmax(filtered_logits, dim=-1)
+            # next_tokens = torch.multinomial(filtered_logits, num_samples=1)   # (B, 1)
+            # generated = torch.cat((generated, next_tokens), dim=1)  # (B, seq_len+1)
+
+            next_tokens = torch.argmax(next_tokens_soft, dim=-1)    # (B, 1)
+            not_finished = next_tokens != self.tokenizer_decoder.encode('<EOS>')[0]
+            if torch.sum(not_finished) == 0:
+                break
+
+        return generated_soft    # (B, seq_len, vocab_size)
+
+
+### Gumbel Softmax
+def gumbel_softmax(logits, temperature, hard=False):
+    """Sample from the Gumbel-Softmax distribution and optionally discretize.
+        Args:
+            logits: [..., n_class] unnormalized log-probs
+            temperature: non-negative scalar
+            hard: if True, take argmax, but differentiate w.r.t. soft sample y
+        Returns:
+            [..., n_class] sample from the Gumbel-Softmax distribution.
+            If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
+    """
+    y = gumbel_softmax_sample(logits, temperature)  # (..., n_class)
+
+    if hard:    # return onehot
+        shape = y.size()
+        _, ind = y.max(dim=-1)
+        y_hard = torch.zeros_like(y).view(-1, shape[-1])
+        y_hard.scatter_(1, ind.view(-1, 1), 1)  # one hot
+        y_hard = y_hard.view(*shape)
+        # Set gradients w.r.t. y_hard gradients w.r.t. y
+        y = (y_hard - y).detach() + y
+
+    return y    # (..., n_class)
+
+from torch.nn import functional as F
+def gumbel_softmax_sample(logits, temperature):
+    y = logits + sample_gumbel(logits.size(), logits.device)
+    return F.softmax(y / temperature, dim=-1)
+
+def sample_gumbel(shape, device, eps=1e-20):
+    U = torch.rand(shape).to(device=device)
+    return -torch.log(-torch.log(U + eps) + eps)
diff --git a/Optimus/code/modules/decoders/dec_gpt2.py b/Optimus/code/modules/decoders/dec_gpt2.py
new file mode 100755
index 0000000000000000000000000000000000000000..9e1a725291a1883d8946f935467f73d3239fd4f0
--- /dev/null
+++ b/Optimus/code/modules/decoders/dec_gpt2.py
@@ -0,0 +1,358 @@
+# import torch
+
+import time
+import argparse
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
+
+import numpy as np
+
+from .decoder import DecoderBase
+
+class LSTMDecoder(DecoderBase):
+    """LSTM decoder with constant-length data"""
+    def __init__(self, args, vocab, model_init, emb_init):
+        super(LSTMDecoder, self).__init__()
+        self.ni = args.ni
+        self.nh = args.dec_nh
+        self.nz = args.nz
+        self.vocab = vocab
+        self.device = args.device
+
+        # no padding when setting padding_idx to -1
+        self.embed = nn.Embedding(len(vocab), args.ni, padding_idx=-1)
+
+        self.dropout_in = nn.Dropout(args.dec_dropout_in)
+        self.dropout_out = nn.Dropout(args.dec_dropout_out)
+
+        # for initializing hidden state and cell
+        self.trans_linear = nn.Linear(args.nz, args.dec_nh, bias=False)
+
+        # concatenate z with input
+        self.lstm = nn.LSTM(input_size=args.ni + args.nz,
+                            hidden_size=args.dec_nh,
+                            num_layers=1,
+                            batch_first=True)
+
+        # prediction layer
+        self.pred_linear = nn.Linear(args.dec_nh, len(vocab), bias=False)
+
+        vocab_mask = torch.ones(len(vocab))
+        # vocab_mask[vocab['<pad>']] = 0
+        self.loss = nn.CrossEntropyLoss(weight=vocab_mask, reduce=False)
+
+        self.reset_parameters(model_init, emb_init)
+
+    def reset_parameters(self, model_init, emb_init):
+        # for name, param in self.lstm.named_parameters():
+        #     # self.initializer(param)
+        #     if 'bias' in name:
+        #         nn.init.constant_(param, 0.0)
+        #         # model_init(param)
+        #     elif 'weight' in name:
+        #         model_init(param)
+
+        # model_init(self.trans_linear.weight)
+        # model_init(self.pred_linear.weight)
+        for param in self.parameters():
+            model_init(param)
+        emb_init(self.embed.weight)
+
+    def sample_text(self, input, z, EOS, device):
+        sentence = [input]
+        max_index = 0
+
+        input_word = input
+        batch_size, n_sample, _ = z.size()
+        seq_len = 1
+        z_ = z.expand(batch_size, seq_len, self.nz)
+        seq_len = input.size(1)
+        softmax = torch.nn.Softmax(dim=0)
+        while max_index != EOS and len(sentence) < 100:
+            # (batch_size, seq_len, ni)
+            word_embed = self.embed(input_word)
+            word_embed = torch.cat((word_embed, z_), -1)
+            c_init = self.trans_linear(z).unsqueeze(0)
+            h_init = torch.tanh(c_init)
+            if len(sentence) == 1:
+                h_init = h_init.squeeze(dim=1)
+                c_init = c_init.squeeze(dim=1)
+                output, hidden = self.lstm.forward(word_embed, (h_init, c_init))
+            else:
+                output, hidden = self.lstm.forward(word_embed, hidden)
+            # (batch_size * n_sample, seq_len, vocab_size)
+            output_logits = self.pred_linear(output)
+            output_logits = output_logits.view(-1)
+            probs = softmax(output_logits)
+            # max_index = torch.argmax(output_logits)
+            max_index = torch.multinomial(probs, num_samples=1)
+            input_word = torch.tensor([[max_index]]).to(device)
+            sentence.append(max_index)
+        return sentence
+
+    def decode(self, input, z):
+        """
+        Args:
+            input: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        """
+
+        # not predicting start symbol
+        # sents_len -= 1
+
+        batch_size, n_sample, _ = z.size()
+        seq_len = input.size(1)
+
+        # (batch_size, seq_len, ni)
+        word_embed = self.embed(input)
+        word_embed = self.dropout_in(word_embed)
+
+        if n_sample == 1:
+            z_ = z.expand(batch_size, seq_len, self.nz)
+
+        else:
+            word_embed = word_embed.unsqueeze(1).expand(batch_size, n_sample, seq_len, self.ni) \
+                                   .contiguous()
+
+            # (batch_size * n_sample, seq_len, ni)
+            word_embed = word_embed.view(batch_size * n_sample, seq_len, self.ni)
+
+            z_ = z.unsqueeze(2).expand(batch_size, n_sample, seq_len, self.nz).contiguous()
+            z_ = z_.view(batch_size * n_sample, seq_len, self.nz)
+
+        # (batch_size * n_sample, seq_len, ni + nz)
+        word_embed = torch.cat((word_embed, z_), -1)
+
+        z = z.view(batch_size * n_sample, self.nz)
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+        # h_init = self.trans_linear(z).unsqueeze(0)
+        # c_init = h_init.new_zeros(h_init.size())
+        output, _ = self.lstm(word_embed, (h_init, c_init))
+
+        output = self.dropout_out(output)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.pred_linear(output)
+
+        return output_logits
+
+    def reconstruct_error(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+
+        #remove end symbol
+        src = x[:, :-1]
+
+        # remove start symbol
+        tgt = x[:, 1:]
+
+        batch_size, seq_len = src.size()
+        n_sample = z.size(1)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.decode(src, z)
+
+        if n_sample == 1:
+            tgt = tgt.contiguous().view(-1)
+        else:
+            # (batch_size * n_sample * seq_len)
+            tgt = tgt.unsqueeze(1).expand(batch_size, n_sample, seq_len) \
+                     .contiguous().view(-1)
+
+        # (batch_size * n_sample * seq_len)
+        loss = self.loss(output_logits.view(-1, output_logits.size(2)),
+                         tgt)
+
+
+        # (batch_size, n_sample)
+        return loss.view(batch_size, n_sample, -1).sum(-1)
+
+
+    def log_probability(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+
+        return -self.reconstruct_error(x, z)
+
+
+
+
+    def greedy_decode(self, z):
+        return self.sample_decode(z, greedy=True)
+
+    def sample_decode(self, z, greedy=False):
+        """sample/greedy decoding from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        batch_size = z.size(0)
+        decoded_batch = [[] for _ in range(batch_size)]
+
+        # (batch_size, 1, nz)
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+
+        decoder_hidden = (h_init, c_init)
+        decoder_input = torch.tensor([self.vocab["<s>"]] * batch_size, dtype=torch.long, device=self.device).unsqueeze(1)
+        end_symbol = torch.tensor([self.vocab["</s>"]] * batch_size, dtype=torch.long, device=self.device)
+
+        mask = torch.ones((batch_size), dtype=torch.uint8, device=self.device)
+        length_c = 1
+        while mask.sum().item() != 0 and length_c < 100:
+
+            # (batch_size, 1, ni) --> (batch_size, 1, ni+nz)
+            word_embed = self.embed(decoder_input)
+            word_embed = torch.cat((word_embed, z.unsqueeze(1)), dim=-1)
+
+            output, decoder_hidden = self.lstm(word_embed, decoder_hidden)
+
+            # (batch_size, 1, vocab_size) --> (batch_size, vocab_size)
+            decoder_output = self.pred_linear(output)
+            output_logits = decoder_output.squeeze(1)
+
+            # (batch_size)
+            if greedy:
+                max_index = torch.argmax(output_logits, dim=1)
+            else:
+                probs = F.softmax(output_logits, dim=1)
+                max_index = torch.multinomial(probs, num_samples=1).squeeze(1)
+
+            decoder_input = max_index.unsqueeze(1)
+            length_c += 1
+
+            for i in range(batch_size):
+                word = self.vocab.id2word(max_index[i].item())
+                if mask[i].item():
+                    decoded_batch[i].append(self.vocab.id2word(max_index[i].item()))
+
+            mask = torch.mul((max_index != end_symbol), mask)
+
+        return decoded_batch
+
+class VarLSTMDecoder(LSTMDecoder):
+    """LSTM decoder with constant-length data"""
+    def __init__(self, args, vocab, model_init, emb_init):
+        super(VarLSTMDecoder, self).__init__(args, vocab, model_init, emb_init)
+
+        self.embed = nn.Embedding(len(vocab), args.ni, padding_idx=vocab['<pad>'])
+        vocab_mask = torch.ones(len(vocab))
+        vocab_mask[vocab['<pad>']] = 0
+        self.loss = nn.CrossEntropyLoss(weight=vocab_mask, reduce=False)
+
+        self.reset_parameters(model_init, emb_init)
+
+    def decode(self, input, z):
+        """
+        Args:
+            input: tuple which contains x and sents_len
+                    x: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+            z: (batch_size, n_sample, nz)
+        """
+
+        input, sents_len = input
+
+        # not predicting start symbol
+        sents_len = sents_len - 1
+
+        batch_size, n_sample, _ = z.size()
+        seq_len = input.size(1)
+
+        # (batch_size, seq_len, ni)
+        word_embed = self.embed(input)
+        word_embed = self.dropout_in(word_embed)
+
+        if n_sample == 1:
+            z_ = z.expand(batch_size, seq_len, self.nz)
+
+        else:
+            word_embed = word_embed.unsqueeze(1).expand(batch_size, n_sample, seq_len, self.ni) \
+                                   .contiguous()
+
+            # (batch_size * n_sample, seq_len, ni)
+            word_embed = word_embed.view(batch_size * n_sample, seq_len, self.ni)
+
+            z_ = z.unsqueeze(2).expand(batch_size, n_sample, seq_len, self.nz).contiguous()
+            z_ = z_.view(batch_size * n_sample, seq_len, self.nz)
+
+        # (batch_size * n_sample, seq_len, ni + nz)
+        word_embed = torch.cat((word_embed, z_), -1)
+
+        sents_len = sents_len.unsqueeze(1).expand(batch_size, n_sample).contiguous().view(-1)
+        packed_embed = pack_padded_sequence(word_embed, sents_len.tolist(), batch_first=True)
+
+        z = z.view(batch_size * n_sample, self.nz)
+        # h_init = self.trans_linear(z).unsqueeze(0)
+        # c_init = h_init.new_zeros(h_init.size())
+        c_init = self.trans_linear(z).unsqueeze(0)
+        h_init = torch.tanh(c_init)
+        output, _ = self.lstm(packed_embed, (h_init, c_init))
+        output, _ = pad_packed_sequence(output, batch_first=True)
+
+        output = self.dropout_out(output)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.pred_linear(output)
+
+        return output_logits
+
+    def reconstruct_error(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: tuple which contains x_ and sents_len
+                    x_: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+
+        x, sents_len = x
+
+        #remove end symbol
+        src = x[:, :-1]
+
+        # remove start symbol
+        tgt = x[:, 1:]
+
+        batch_size, seq_len = src.size()
+        n_sample = z.size(1)
+
+        # (batch_size * n_sample, seq_len, vocab_size)
+        output_logits = self.decode((src, sents_len), z)
+
+        if n_sample == 1:
+            tgt = tgt.contiguous().view(-1)
+        else:
+            # (batch_size * n_sample * seq_len)
+            tgt = tgt.unsqueeze(1).expand(batch_size, n_sample, seq_len) \
+                     .contiguous().view(-1)
+
+        # (batch_size * n_sample * seq_len)
+        loss = self.loss(output_logits.view(-1, output_logits.size(2)),
+                         tgt)
+
+
+        # (batch_size, n_sample)
+        return loss.view(batch_size, n_sample, -1).sum(-1)
\ No newline at end of file
diff --git a/Optimus/code/modules/decoders/decoder.py b/Optimus/code/modules/decoders/decoder.py
new file mode 100755
index 0000000000000000000000000000000000000000..da75beb16da7e929f04c5178336096ecc6e7facf
--- /dev/null
+++ b/Optimus/code/modules/decoders/decoder.py
@@ -0,0 +1,79 @@
+import torch
+import torch.nn as nn
+
+
+class DecoderBase(nn.Module):
+    """docstring for Decoder"""
+    def __init__(self):
+        super(DecoderBase, self).__init__()
+
+
+    def freeze(self):
+        for param in self.parameters():
+            param.requires_grad = False
+
+    def decode(self, x, z):
+        """
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns: Tensor1
+            Tensor1: the output logits with size (batch_size * n_sample, seq_len, vocab_size)
+        """
+
+        raise NotImplementedError
+
+    def reconstruct_error(self, x, z):
+        """reconstruction loss
+        Args:
+            x: (batch_size, *)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            loss: (batch_size, n_sample). Loss
+            across different sentence and z
+        """
+
+        raise NotImplementedError
+
+    def beam_search_decode(self, z, K):
+        """beam search decoding
+        Args:
+            z: (batch_size, nz)
+            K: the beam size
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        raise NotImplementedError
+
+    def sample_decode(self, z):
+        """sampling from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        raise NotImplementedError
+
+    def greedy_decode(self, z):
+        """greedy decoding from z
+        Args:
+            z: (batch_size, nz)
+        Returns: List1
+            List1: the decoded word sentence list
+        """
+
+        raise NotImplementedError
+
+    def log_probability(self, x, z):
+        """
+        Args:
+            x: (batch_size, *)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+
+        raise NotImplementedError
\ No newline at end of file
diff --git a/Optimus/code/modules/encoders/__init__.py b/Optimus/code/modules/encoders/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..8b63707c81d4f1872b1d02baac891c8ac40b32f8
--- /dev/null
+++ b/Optimus/code/modules/encoders/__init__.py
@@ -0,0 +1 @@
+from .enc_lstm import *
\ No newline at end of file
diff --git a/Optimus/code/modules/encoders/__pycache__/__init__.cpython-310.pyc b/Optimus/code/modules/encoders/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4064121b21f11a35b14727a38ed86d634b52bd08
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/__init__.cpython-310.pyc differ
diff --git a/Optimus/code/modules/encoders/__pycache__/__init__.cpython-37.pyc b/Optimus/code/modules/encoders/__pycache__/__init__.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4737351fdf380d6fe763085321c130ab5ccc50a1
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/__init__.cpython-37.pyc differ
diff --git a/Optimus/code/modules/encoders/__pycache__/enc_lstm.cpython-310.pyc b/Optimus/code/modules/encoders/__pycache__/enc_lstm.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..11901ae2150a18d07dde6b8d9f09bdcee4969ed7
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/enc_lstm.cpython-310.pyc differ
diff --git a/Optimus/code/modules/encoders/__pycache__/enc_lstm.cpython-37.pyc b/Optimus/code/modules/encoders/__pycache__/enc_lstm.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..43dc40a23ef3bf3ab151af453f1b8963c843e0d3
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/enc_lstm.cpython-37.pyc differ
diff --git a/Optimus/code/modules/encoders/__pycache__/encoder.cpython-310.pyc b/Optimus/code/modules/encoders/__pycache__/encoder.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..db78af3c55aa563f0a49a886ee477f197cdeb72b
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/encoder.cpython-310.pyc differ
diff --git a/Optimus/code/modules/encoders/__pycache__/encoder.cpython-37.pyc b/Optimus/code/modules/encoders/__pycache__/encoder.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..fc6e332d549a000245693d6ba8a71a51803ff2a5
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/encoder.cpython-37.pyc differ
diff --git a/Optimus/code/modules/encoders/__pycache__/gaussian_encoder.cpython-310.pyc b/Optimus/code/modules/encoders/__pycache__/gaussian_encoder.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a6670655be4bfab75c85de8c1cd82bfc93073c63
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/gaussian_encoder.cpython-310.pyc differ
diff --git a/Optimus/code/modules/encoders/__pycache__/gaussian_encoder.cpython-37.pyc b/Optimus/code/modules/encoders/__pycache__/gaussian_encoder.cpython-37.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d7d78d29167610f4ac3263144b7ea1e131baee43
Binary files /dev/null and b/Optimus/code/modules/encoders/__pycache__/gaussian_encoder.cpython-37.pyc differ
diff --git a/Optimus/code/modules/encoders/enc_lstm.py b/Optimus/code/modules/encoders/enc_lstm.py
new file mode 100755
index 0000000000000000000000000000000000000000..3fe5a1a342bf5d823dc9f43141cad2a7a80f6ee7
--- /dev/null
+++ b/Optimus/code/modules/encoders/enc_lstm.py
@@ -0,0 +1,126 @@
+from itertools import chain
+import math
+import torch
+import torch.nn as nn
+
+from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
+from .gaussian_encoder import GaussianEncoderBase
+from ..utils import log_sum_exp
+
+class GaussianLSTMEncoder(GaussianEncoderBase):
+    """Gaussian LSTM Encoder with constant-length input"""
+    def __init__(self, args, vocab_size, model_init, emb_init):
+        super(GaussianLSTMEncoder, self).__init__()
+        self.ni = args.ni
+        self.nh = args.enc_nh
+        self.nz = args.nz
+        self.args = args
+
+        self.embed = nn.Embedding(vocab_size, args.ni)
+
+        self.lstm = nn.LSTM(input_size=args.ni,
+                            hidden_size=args.enc_nh,
+                            num_layers=1,
+                            batch_first=True,
+                            dropout=0)
+
+        self.linear = nn.Linear(args.enc_nh, 2 * args.nz, bias=False)
+
+        self.reset_parameters(model_init, emb_init)
+
+    def reset_parameters(self, model_init, emb_init):
+        # for name, param in self.lstm.named_parameters():
+        #     # self.initializer(param)
+        #     if 'bias' in name:
+        #         nn.init.constant_(param, 0.0)
+        #         # model_init(param)
+        #     elif 'weight' in name:
+        #         model_init(param)
+
+        # model_init(self.linear.weight)
+        # emb_init(self.embed.weight)
+        for param in self.parameters():
+            model_init(param)
+        emb_init(self.embed.weight)
+
+
+    def forward(self, input):
+        """
+        Args:
+            x: (batch_size, seq_len)
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+
+        # (batch_size, seq_len-1, args.ni)
+        word_embed = self.embed(input)
+
+        _, (last_state, last_cell) = self.lstm(word_embed)
+
+        mean, logvar = self.linear(last_state).chunk(2, -1)
+
+        # fix variance as a pre-defined value
+        if self.args.fix_var > 0:
+            logvar = mean.new_tensor([[[math.log(self.args.fix_var)]]]).expand_as(mean)
+            
+        return mean.squeeze(0), logvar.squeeze(0)
+
+    # def eval_inference_mode(self, x):
+    #     """compute the mode points in the inference distribution
+    #     (in Gaussian case)
+    #     Returns: Tensor
+    #         Tensor: the posterior mode points with shape (*, nz)
+    #     """
+
+    #     # (batch_size, nz)
+    #     mu, logvar = self.forward(x)
+
+
+class VarLSTMEncoder(GaussianLSTMEncoder):
+    """Gaussian LSTM Encoder with variable-length input"""
+    def __init__(self, args, vocab_size, model_init, emb_init):
+        super(VarLSTMEncoder, self).__init__(args, vocab_size, model_init, emb_init)
+
+
+    def forward(self, input):
+        """
+        Args:
+            input: tuple which contains x and sents_len
+                    x: (batch_size, seq_len)
+                    sents_len: long tensor of sentence lengths
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+
+        input, sents_len = input
+        # (batch_size, seq_len, args.ni)
+        word_embed = self.embed(input)
+
+        packed_embed = pack_padded_sequence(word_embed, sents_len.tolist(), batch_first=True)
+
+        _, (last_state, last_cell) = self.lstm(packed_embed)
+
+        mean, logvar = self.linear(last_state).chunk(2, -1)
+
+        return mean.squeeze(0), logvar.squeeze(0)
+
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Args:
+            input: tuple which contains x and sents_len
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
diff --git a/Optimus/code/modules/encoders/encoder.py b/Optimus/code/modules/encoders/encoder.py
new file mode 100755
index 0000000000000000000000000000000000000000..6daed22c92648923eb90a1f49d91a07f75d63262
--- /dev/null
+++ b/Optimus/code/modules/encoders/encoder.py
@@ -0,0 +1,58 @@
+import math
+import torch
+import torch.nn as nn
+
+from ..utils import log_sum_exp
+
+class EncoderBase(nn.Module):
+    """docstring for EncoderBase"""
+    def __init__(self):
+        super(EncoderBase, self).__init__()
+
+    def forward(self, x):
+        """
+        Args:
+            x: (batch_size, *)
+        Returns: the tensors required to parameterize a distribution.
+        E.g. for Gaussian encoder it returns the mean and variance tensors
+        """
+
+        raise NotImplementedError
+
+    def sample(self, input, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+
+        raise NotImplementedError
+
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        raise NotImplementedError
+
+
+    def eval_inference_dist(self, x, z, param=None):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+
+        raise NotImplementedError
+
+    def calc_mi(self, x):
+        """Approximate the mutual information between x and z
+        I(x, z) = E_xE_{q(z|x)}log(q(z|x)) - E_xE_{q(z|x)}log(q(z))
+        Returns: Float
+        """
+
+        raise NotImplementedError
\ No newline at end of file
diff --git a/Optimus/code/modules/encoders/gaussian_encoder.py b/Optimus/code/modules/encoders/gaussian_encoder.py
new file mode 100755
index 0000000000000000000000000000000000000000..1b97e7eec85a7d4fcf064da1c90bbc07e8b97073
--- /dev/null
+++ b/Optimus/code/modules/encoders/gaussian_encoder.py
@@ -0,0 +1,147 @@
+import math
+import torch
+import torch.nn as nn
+
+from .encoder import EncoderBase
+from ..utils import log_sum_exp
+
+class GaussianEncoderBase(EncoderBase):
+    """docstring for EncoderBase"""
+    def __init__(self):
+        super(GaussianEncoderBase, self).__init__()
+
+    def freeze(self):
+        for param in self.parameters():
+            param.requires_grad = False
+
+    def forward(self, x):
+        """
+        Args:
+            x: (batch_size, *)
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean tensor, shape (batch, nz)
+            Tensor2: the logvar tensor, shape (batch, nz)
+        """
+
+        raise NotImplementedError
+
+    def encode_stats(self, x):
+
+        return self.forward(x)
+
+    def sample(self, input, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        return z, (mu, logvar)
+
+    def encode(self, input, nsamples):
+        """perform the encoding and compute the KL term
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+        mu, logvar = self.forward(input)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
+
+    def reparameterize(self, mu, logvar, nsamples=1):
+        """sample from posterior Gaussian family
+        Args:
+            mu: Tensor
+                Mean of gaussian distribution with shape (batch, nz)
+            logvar: Tensor
+                logvar of gaussian distibution with shape (batch, nz)
+        Returns: Tensor
+            Sampled z with shape (batch, nsamples, nz)
+        """
+        batch_size, nz = mu.size()
+        std = logvar.mul(0.5).exp()
+
+        mu_expd = mu.unsqueeze(1).expand(batch_size, nsamples, nz)
+        std_expd = std.unsqueeze(1).expand(batch_size, nsamples, nz)
+
+        eps = torch.zeros_like(std_expd).normal_()
+
+        return mu_expd + torch.mul(eps, std_expd)
+
+    def eval_inference_dist(self, x, z, param=None):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+
+        nz = z.size(2)
+
+        if not param:
+            mu, logvar = self.forward(x)
+        else:
+            mu, logvar = param
+
+        # (batch_size, 1, nz)
+        mu, logvar = mu.unsqueeze(1), logvar.unsqueeze(1)
+        var = logvar.exp()
+
+        # (batch_size, nsamples, nz)
+        dev = z - mu
+
+        # (batch_size, nsamples)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+        return log_density
+
+
+
+    def calc_mi(self, x):
+        """Approximate the mutual information between x and z
+        I(x, z) = E_xE_{q(z|x)}log(q(z|x)) - E_xE_{q(z|x)}log(q(z))
+        Returns: Float
+        """
+
+        # [x_batch, nz]
+        mu, logvar = self.forward(x)
+
+        x_batch, nz = mu.size()
+
+        # E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
+        neg_entropy = (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).mean()
+
+        # [z_batch, 1, nz]
+        z_samples = self.reparameterize(mu, logvar, 1)
+
+        # [1, x_batch, nz]
+        mu, logvar = mu.unsqueeze(0), logvar.unsqueeze(0)
+        var = logvar.exp()
+
+        # (z_batch, x_batch, nz)
+        dev = z_samples - mu
+
+        # (z_batch, x_batch)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+        # log q(z): aggregate posterior
+        # [z_batch]
+        log_qz = log_sum_exp(log_density, dim=1) - math.log(x_batch)
+
+        return (neg_entropy - log_qz.mean(-1)).item()
\ No newline at end of file
diff --git a/Optimus/code/modules/spacefusion.py b/Optimus/code/modules/spacefusion.py
new file mode 100755
index 0000000000000000000000000000000000000000..bacfd96016853c56ddf1774c37b238b6be4737a3
--- /dev/null
+++ b/Optimus/code/modules/spacefusion.py
@@ -0,0 +1,143 @@
+from .vae import VAE
+import numpy as np
+import torch, copy, pdb
+import torch.nn.functional as F
+
+from torch import nn
+
+import pdb
+
+
+def set_trainable(module, value):
+    for param in module.parameters():
+        param.requires_grad = value
+
+class SpaceFusion(VAE):
+    def __init__(self, encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args): 
+        super(SpaceFusion, self).__init__(encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args)
+        children = [v for v in encoder.encoder.layer.children()]    # list of 12 BertLayer
+
+        self.num_s2s_bert_layer = args.num_s2s_bert_layer
+        self.S2S_layers = nn.ModuleList([copy.deepcopy(c) for c in children[-args.num_s2s_bert_layer:] ])    # the last layer of encoder
+        self.S2S_pooler = copy.deepcopy(encoder.pooler)
+        self.ix_turn_sep = tokenizer_encoder.convert_tokens_to_ids('[SEP]')
+        if args.freeze_bert:
+            print('@'*20 + f' freezing BERT {args.num_frozen_bert_layer} layers')
+            for child in children[:args.num_frozen_bert_layer]:
+                set_trainable(child, False)
+
+
+
+    def ids2speaker(self, ids):
+        # 0 for speaker A, 1 for speaker B
+        N, T = ids.shape
+        speaker = np.zeros((N, T))
+        sep = ids == self.ix_turn_sep
+        for i in range(N):
+            is_B = False    # start with speaker A
+            for t in range(T):
+                speaker[i,t] = int(is_B)
+                if sep[i,t].item():
+                    is_B = not is_B
+
+        # make sure the final speaker is speaker B (so response is always speaker A)
+        if not is_B:
+            speaker = 1 - speaker
+
+        return torch.LongTensor(speaker).to(ids.device)
+
+    def forward(self, inputs_src, inputs_tgt, labels_tgt, return_vec=False):  # [batch, time]
+        # toggle config to get desired encoder output
+        self.encoder.encoder.output_attentions = False
+        self.encoder.encoder.output_hidden_states = True
+
+        
+        # AE encoder
+        mask = (inputs_tgt > 0).float().to(inputs_src.device)
+        outputs = self.encoder(inputs_tgt, attention_mask=mask)
+        z_AE, _ = self.connect(outputs[1])
+        z_AE = z_AE.squeeze(1)
+
+        # S2S encoder
+        mask = (inputs_src > 0).float()
+        speaker = self.ids2speaker(inputs_src)
+        outputs = self.encoder(inputs_src, attention_mask=mask, token_type_ids=speaker)
+        _, _, all_layer_attn = outputs      # last_layer_attn, pooled, all_layer_attn = outputs
+        seq_z_prev = all_layer_attn[-self.num_s2s_bert_layer-1]     # seq of z at layer 11 ()
+
+        for s2s in self.S2S_layers: 
+            layer_outputs = s2s(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+            seq_z_prev = layer_outputs[0]
+
+        z_S2S = self.encoder.pooler(layer_outputs[0])
+        z_S2S, _ = self.connect(z_S2S)
+        z_S2S = z_S2S.squeeze(1)
+
+        if return_vec:
+            return z_AE, z_S2S
+
+        # interpolation/smoothness
+        u = torch.FloatTensor(np.random.random((z_AE.shape[0], 1))).to(inputs_tgt.device)
+        z_interp = u * z_AE + (1 - u) * z_S2S
+        std = 0.1
+        noise = torch.FloatTensor(np.random.normal(size=z_interp.shape) * std).to(z_interp.device)
+        z_interp = z_interp + noise
+
+        loss_rec = 0
+        z_idx = 0
+        for z in [z_AE, z_S2S, z_interp]:
+            #pdb.set_trace()
+            past = z # past = self.decoder.linear(z)
+            outputs = self.decoder(input_ids=labels_tgt, past=past, labels=labels_tgt, label_ignore=self.pad_token_id)
+            if z_idx == 1:
+                loss_rec = loss_rec + 1.0 * outputs[0]
+            else:
+                loss_rec = loss_rec + outputs[0]
+            z_idx += 1
+        loss_rec = loss_rec/3
+        
+        # fusion/regularization
+        L_pull = self.dist_pair(z_AE, z_S2S)
+        L_push = torch.stack([self.dist_batch(z) for z in [z_AE, z_S2S]]).min()
+        loss_reg = (L_pull - L_push * 2) / np.sqrt(z.shape[-1])
+        
+        loss = loss_rec + self.args.beta * loss_reg
+        return loss_rec, loss_reg, loss
+
+    def sent2latent(self, inputs_src):
+        # toggle config to get desired encoder output
+        self.encoder.encoder.output_attentions = False
+        self.encoder.encoder.output_hidden_states = True
+
+        # S2S encoder
+        mask = (inputs_src > 0).float()
+        speaker = self.ids2speaker(inputs_src)
+        outputs = self.encoder(inputs_src, attention_mask=mask, token_type_ids=speaker)
+
+        _, _, all_layer_attn = outputs      # last_layer_attn, pooled, all_layer_attn = outputs
+        # seq_z_prev = all_layer_attn[-2]     # seq of z at layer 11 ()
+        # layer_outputs = self.S2S_layer(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+
+        seq_z_prev = all_layer_attn[-self.num_s2s_bert_layer-1]     # seq of z at layer 11 ()
+        for s2s in self.S2S_layers: 
+            layer_outputs = s2s(seq_z_prev, attention_mask=mask.unsqueeze(1).unsqueeze(1))
+            seq_z_prev = layer_outputs[0]
+
+        z_S2S = self.encoder.pooler(layer_outputs[0])
+        z_S2S, _ = self.connect(z_S2S)
+        z_S2S = z_S2S.squeeze(1)
+        
+        return z_S2S
+
+
+    def dist_pair(self, a, b):
+        return F.pairwise_distance(a, b).mean()
+
+
+    def dist_batch(self, vec):
+        n = vec.shape[0]
+        dmin = []
+        for i in range(n):
+            dd = F.pairwise_distance(vec[i:i+1,:].repeat(n,1), vec)
+            dmin.append(dd.min())
+        return torch.stack(dmin).mean()
\ No newline at end of file
diff --git a/Optimus/code/modules/utils.py b/Optimus/code/modules/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..57afd02c2d43e895143569a0f29e431043510409
--- /dev/null
+++ b/Optimus/code/modules/utils.py
@@ -0,0 +1,40 @@
+import torch
+
+def safe_log(z):
+    return torch.log(z + 1e-7)
+
+def log_sum_exp(value, dim=None, keepdim=False):
+    """Numerically stable implementation of the operation
+    value.exp().sum(dim, keepdim).log()
+    """
+    if dim is not None:
+        m, _ = torch.max(value, dim=dim, keepdim=True)
+        value0 = value - m
+        if keepdim is False:
+            m = m.squeeze(dim)
+        return m + torch.log(torch.sum(torch.exp(value0), dim=dim, keepdim=keepdim))
+    else:
+        m = torch.max(value)
+        sum_exp = torch.sum(torch.exp(value - m))
+        return m + torch.log(sum_exp)
+
+
+def generate_grid(zmin, zmax, dz, device, ndim=2):
+    """generate a 1- or 2-dimensional grid
+    Returns: Tensor, int
+        Tensor: The grid tensor with shape (k^2, 2),
+            where k=(zmax - zmin)/dz
+        int: k
+    """
+
+    if ndim == 2:
+        x = torch.arange(zmin, zmax, dz)
+        k = x.size(0)
+
+        x1 = x.unsqueeze(1).repeat(1, k).view(-1)
+        x2 = x.repeat(k)
+
+        return torch.cat((x1.unsqueeze(-1), x2.unsqueeze(-1)), dim=-1).to(device), k
+
+    elif ndim == 1:
+        return torch.arange(zmin, zmax, dz).unsqueeze(1).to(device)
\ No newline at end of file
diff --git a/Optimus/code/modules/vae.py b/Optimus/code/modules/vae.py
new file mode 100755
index 0000000000000000000000000000000000000000..e3e697383556b455ba9a247a51113d281c0cb8cd
--- /dev/null
+++ b/Optimus/code/modules/vae.py
@@ -0,0 +1,638 @@
+import math
+import torch
+import torch.nn as nn
+
+from .utils import log_sum_exp
+
+import pdb
+
+import logging
+logger = logging.getLogger(__name__)
+
+
+class VAE(nn.Module):
+    """VAE with normal prior"""
+    def __init__(self, encoder, decoder,  tokenizer_encoder, tokenizer_decoder, args): # 
+        super(VAE, self).__init__()
+        self.encoder = encoder
+        self.decoder = decoder
+
+        self.args = args
+        self.nz = args.latent_size
+
+        self.eos_token_id = tokenizer_decoder.convert_tokens_to_ids([tokenizer_decoder.eos_token])[0]
+        self.pad_token_id = tokenizer_decoder.convert_tokens_to_ids([tokenizer_decoder.pad_token])[0]
+
+
+        # connector: from Bert hidden units to the latent space
+        # self.linear = nn.Linear(args.nz, 2 * args.nz, bias=False)
+
+        # Standard Normal prior
+        loc = torch.zeros(self.nz, device=args.device)
+        scale = torch.ones(self.nz, device=args.device)
+        self.prior = torch.distributions.normal.Normal(loc, scale)
+
+    def connect(self, bert_fea, nsamples=1):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+
+        mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        # pdb.set_trace()
+        # mean, logvar = mean.squeeze(0), logvar.squeeze(0)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mean, logvar, nsamples)
+        KL = 0.5 * (mean.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
+
+    def connect_deterministic(self, bert_fea, nsamples=1):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+            Tensor2: the tenor of KL for each x with shape [batch]
+        """
+
+        # (batch_size, nz)
+
+        mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        # pdb.set_trace()
+        # mean, logvar = mean.squeeze(0), logvar.squeeze(0)
+
+        logvar.fill_(.0)
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mean, logvar, nsamples)
+        KL = 0.5 * (mean.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        return z, KL
+
+
+
+    def reparameterize(self, mu, logvar, nsamples=1):
+        """sample from posterior Gaussian family
+        Args:
+            mu: Tensor
+                Mean of gaussian distribution with shape (batch, nz)
+            logvar: Tensor
+                logvar of gaussian distibution with shape (batch, nz)
+        Returns: Tensor
+            Sampled z with shape (batch, nsamples, nz)
+        """
+        batch_size, nz = mu.size()
+        std = logvar.mul(0.5).exp()
+
+        mu_expd = mu.unsqueeze(1).expand(batch_size, nsamples, nz)
+        std_expd = std.unsqueeze(1).expand(batch_size, nsamples, nz)
+
+        eps = torch.zeros_like(std_expd).normal_()
+
+        return mu_expd + torch.mul(eps, std_expd)
+
+    def forward(self, inputs, labels):
+
+        # pdb.set_trace()   
+        
+        attention_mask=(inputs > 0).float()
+        # logger.info(inputs)
+        # logger.info(attention_mask)
+        # logger.info(labels)
+        reconstrution_mask=(labels != 50257).float() # 50257 is the padding token for GPT2
+        sent_length = torch.sum(reconstrution_mask, dim=1)
+
+        
+        outputs = self.encoder(inputs, attention_mask)
+        pooled_hidden_fea = outputs[1]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+        if self.args.fb_mode==0: 
+            # Connect hidden feature to the latent space
+            latent_z, loss_kl = self.connect(pooled_hidden_fea)
+            latent_z = latent_z.squeeze(1)
+
+            
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+    
+        elif self.args.fb_mode==1:  
+            # Connect hidden feature to the latent space
+            mu, logvar = self.encoder.linear(pooled_hidden_fea).chunk(2, -1)
+            latent_z = self.reparameterize(mu, logvar, nsamples=1)
+            latent_z = latent_z.squeeze(1)
+            loss_kl = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1)
+            kl_mask = (loss_kl > self.args.dim_target_kl).float()
+            loss_kl = (kl_mask * loss_kl).sum(dim=1)
+
+            # pdb.set_trace()
+            # past = self.decoder.linear(latent_z)
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+        elif self.args.fb_mode==2: 
+            # Connect hidden feature to the latent space
+            latent_z, loss_kl = self.connect_deterministic(pooled_hidden_fea)
+            latent_z = latent_z.squeeze(1)
+
+            # past = self.decoder.linear(latent_z)
+            # Decoding
+            outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
+            loss_rec = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            
+        # pdb.set_trace()
+        if self.args.length_weighted_loss:
+            loss = loss_rec / sent_length + self.args.beta * loss_kl
+        else:
+            loss = loss_rec + self.args.beta * loss_kl 
+
+
+        return loss_rec, loss_kl, loss
+
+
+
+    def encoder_sample(self, bert_fea, nsamples):
+        """sampling from the encoder
+        Returns: Tensor1
+            Tensor1: the tensor latent z with shape [batch, nsamples, nz]
+        """
+
+        # (batch_size, nz)
+
+        mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        mu, logvar = mu.squeeze(0), logvar.squeeze(0)
+
+        # (batch, nsamples, nz)
+        z = self.reparameterize(mu, logvar, nsamples)
+
+        return z, (mu, logvar)
+
+
+    def encode_stats(self, x):
+        """
+        Returns: Tensor1, Tensor2
+            Tensor1: the mean of latent z with shape [batch, nz]
+            Tensor2: the logvar of latent z with shape [batch, nz]
+        """
+
+        return self.encoder.encode_stats(x)
+
+    def decode(self, z, strategy, K=10):
+        """generate samples from z given strategy
+        Args:
+            z: [batch, nsamples, nz]
+            strategy: "beam" or "greedy" or "sample"
+            K: the beam width parameter
+        Returns: List1
+            List1: a list of decoded word sequence
+        """
+
+        if strategy == "beam":
+            return self.decoder.beam_search_decode(z, K)
+        elif strategy == "greedy":
+            return self.decoder.greedy_decode(z)
+        elif strategy == "sample":
+            return self.decoder.sample_decode(z)
+        else:
+            raise ValueError("the decoding strategy is not supported")
+
+
+    def reconstruct(self, x, decoding_strategy="greedy", K=5):
+        """reconstruct from input x
+        Args:
+            x: (batch, *)
+            decoding_strategy: "beam" or "greedy" or "sample"
+            K: the beam width parameter
+        Returns: List1
+            List1: a list of decoded word sequence
+        """
+        z = self.sample_from_inference(x).squeeze(1)
+
+        return self.decode(z, decoding_strategy, K)
+
+    def log_probability(self, x, z):
+        """Cross Entropy in the language case
+        Args:
+            x: (batch_size, seq_len)
+            z: (batch_size, n_sample, nz)
+        Returns:
+            log_p: (batch_size, n_sample).
+                log_p(x|z) across different x and z
+        """
+        outputs = self.decoder(input_ids=x, past=z, labels=x, label_ignore=self.pad_token_id)
+        loss_rec = outputs[0]
+        return -loss_rec
+
+
+
+    def loss_iw(self, x0, x1, nsamples=50, ns=1):
+        """
+        Args:
+            x: if the data is constant-length, x is the data tensor with
+                shape (batch, *). Otherwise x is a tuple that contains
+                the data tensor and length list
+        Returns: Tensor1, Tensor2, Tensor3
+            Tensor1: total loss [batch]
+            Tensor2: reconstruction loss shape [batch]
+            Tensor3: KL loss shape [batch]
+        """
+
+        # encoding into bert features
+        bert_fea = self.encoder(x0)[1]
+
+        # (batch_size, nz)
+
+        mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+        
+
+        ##################
+        # compute KL
+        ##################
+        # pdb.set_trace()
+        KL = 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=1)
+
+        # mu, logvar = mu.squeeze(0), logvar.squeeze(0)
+        ll_tmp, rc_tmp = [], []
+        for _ in range(int(nsamples / ns)):
+
+            # (batch, nsamples, nz)
+            z = self.reparameterize(mu, logvar, ns)
+            # past = self.decoder.linear(z)
+            past = z
+         
+            # [batch, nsamples]
+            log_prior = self.eval_prior_dist(z)
+            log_gen = self.eval_cond_ll(x1, past)
+            log_infer = self.eval_inference_dist(z, (mu, logvar))
+
+            # pdb.set_trace()
+            log_gen = log_gen.unsqueeze(0).contiguous().view(z.shape[0],-1)
+
+
+            # pdb.set_trace()
+            rc_tmp.append(log_gen)
+            ll_tmp.append(log_gen + log_prior - log_infer)
+
+            
+        
+        log_prob_iw = log_sum_exp(torch.cat(ll_tmp, dim=-1), dim=-1) - math.log(nsamples)
+        log_gen_iw = torch.mean(torch.cat(rc_tmp, dim=-1), dim=-1)
+
+        return log_prob_iw, log_gen_iw , KL
+
+
+    def nll_iw(self, x0, x1, nsamples, ns=1):
+        """compute the importance weighting estimate of the log-likelihood
+        Args:
+            x0, x1:  two different tokenization results of x, where x is the data tensor with shape (batch, *). 
+            nsamples: Int
+                the number of samples required to estimate marginal data likelihood
+        Returns: Tensor1
+            Tensor1: the estimate of log p(x), shape [batch]
+        """
+
+        # compute iw every ns samples to address the memory issue
+        # nsamples = 500, ns = 100
+        # nsamples = 500, ns = 10
+
+        # TODO: note that x is forwarded twice in self.encoder.sample(x, ns) and self.eval_inference_dist(x, z, param)
+        #.      this problem is to be solved in order to speed up
+
+        tmp = []
+        for _ in range(int(nsamples / ns)):
+            # [batch, ns, nz]
+
+            # Chunyuan:
+            # encoding into bert features
+            pooled_hidden_fea = self.encoder(x0)[1]
+
+            # param is the parameters required to evaluate q(z|x)
+            z, param = self.encoder_sample(pooled_hidden_fea, ns)
+
+            # [batch, ns]
+            log_comp_ll = self.eval_complete_ll(x1, z)
+            log_infer_ll = self.eval_inference_dist(z, param)
+
+            tmp.append(log_comp_ll - log_infer_ll)
+
+        ll_iw = log_sum_exp(torch.cat(tmp, dim=-1), dim=-1) - math.log(nsamples)
+
+        return ll_iw
+
+    def KL(self, x):
+        _, KL = self.encode(x, 1)
+
+        return KL
+
+    def eval_prior_dist(self, zrange):
+        """perform grid search to calculate the true posterior
+        Args:
+            zrange: tensor
+                different z points that will be evaluated, with
+                shape (k^2, nz), where k=(zmax - zmin)/space
+        """
+
+        # (k^2)
+        return self.prior.log_prob(zrange).sum(dim=-1)
+
+    def eval_complete_ll(self, x, z):
+        """compute log p(z,x)
+        Args:
+            x: Tensor
+                input with shape [batch, seq_len]
+            z: Tensor
+                evaluation points with shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log p(z,x) Tensor with shape [batch, nsamples]
+        """
+
+        # [batch, nsamples]
+        log_prior = self.eval_prior_dist(z)
+        log_gen = self.eval_cond_ll(x, z)
+
+        return log_prior + log_gen
+
+
+
+    def eval_cond_ll(self, x, z):
+        """compute log p(x|z)
+        """
+        x_shape = list(x.size())
+        z_shape = list(z.size())
+        if len(z_shape) == 3:
+            x = x.unsqueeze(1).repeat(1, z_shape[1], 1).contiguous().view(x_shape[0]*z_shape[1], x_shape[-1]) 
+            z = z.contiguous().view(x_shape[0]*z_shape[1], z_shape[-1]) 
+
+        return self.log_probability(x, z)
+
+
+
+    def eval_log_model_posterior(self, x, grid_z):
+        """perform grid search to calculate the true posterior
+         this function computes p(z|x)
+        Args:
+            grid_z: tensor
+                different z points that will be evaluated, with
+                shape (k^2, nz), where k=(zmax - zmin)/pace
+        Returns: Tensor
+            Tensor: the log posterior distribution log p(z|x) with
+                    shape [batch_size, K^2]
+        """
+        try:
+            batch_size = x.size(0)
+        except:
+            batch_size = x[0].size(0)
+
+        # (batch_size, k^2, nz)
+        grid_z = grid_z.unsqueeze(0).expand(batch_size, *grid_z.size()).contiguous()
+
+        # (batch_size, k^2)
+        log_comp = self.eval_complete_ll(x, grid_z)
+
+        # normalize to posterior
+        log_posterior = log_comp - log_sum_exp(log_comp, dim=1, keepdim=True)
+
+        return log_posterior
+
+    def sample_from_inference(self, x, nsamples=1):
+        """perform sampling from inference net
+        Returns: Tensor
+            Tensor: samples from infernece nets with
+                shape (batch_size, nsamples, nz)
+        """
+        z, _ = self.encoder.sample(x, nsamples)
+
+        return z
+
+
+    def sample_from_posterior(self, x, nsamples):
+        """perform MH sampling from model posterior
+        Returns: Tensor
+            Tensor: samples from model posterior with
+                shape (batch_size, nsamples, nz)
+        """
+
+        # use the samples from inference net as initial points
+        # for MCMC sampling. [batch_size, nsamples, nz]
+        cur = self.encoder.sample_from_inference(x, 1)
+        cur_ll = self.eval_complete_ll(x, cur)
+        total_iter = self.args.mh_burn_in + nsamples * self.args.mh_thin
+        samples = []
+        for iter_ in range(total_iter):
+            next = torch.normal(mean=cur,
+                std=cur.new_full(size=cur.size(), fill_value=self.args.mh_std))
+            # [batch_size, 1]
+            next_ll = self.eval_complete_ll(x, next)
+            ratio = next_ll - cur_ll
+
+            accept_prob = torch.min(ratio.exp(), ratio.new_ones(ratio.size()))
+
+            uniform_t = accept_prob.new_empty(accept_prob.size()).uniform_()
+
+            # [batch_size, 1]
+            mask = (uniform_t < accept_prob).float()
+            mask_ = mask.unsqueeze(2)
+
+            cur = mask_ * next + (1 - mask_) * cur
+            cur_ll = mask * next_ll + (1 - mask) * cur_ll
+
+            if iter_ >= self.args.mh_burn_in and (iter_ - self.args.mh_burn_in) % self.args.mh_thin == 0:
+                samples.append(cur.unsqueeze(1))
+
+        return torch.cat(samples, dim=1)
+
+
+    def calc_model_posterior_mean(self, x, grid_z):
+        """compute the mean value of model posterior, i.e. E_{z ~ p(z|x)}[z]
+        Args:
+            grid_z: different z points that will be evaluated, with
+                    shape (k^2, nz), where k=(zmax - zmin)/pace
+            x: [batch, *]
+        Returns: Tensor1
+            Tensor1: the mean value tensor with shape [batch, nz]
+        """
+
+        # [batch, K^2]
+        log_posterior = self.eval_log_model_posterior(x, grid_z)
+        posterior = log_posterior.exp()
+
+        # [batch, nz]
+        return torch.mul(posterior.unsqueeze(2), grid_z.unsqueeze(0)).sum(1)
+
+    def calc_infer_mean(self, x):
+        """
+        Returns: Tensor1
+            Tensor1: the mean of inference distribution, with shape [batch, nz]
+        """
+
+        mean, logvar = self.encoder.forward(x)
+
+        return mean
+
+
+ 
+
+    def eval_inference_dist(self, z, param):
+        """this function computes log q(z | x)
+        Args:
+            z: tensor
+                different z points that will be evaluated, with
+                shape [batch, nsamples, nz]
+        Returns: Tensor1
+            Tensor1: log q(z|x) with shape [batch, nsamples]
+        """
+
+        nz = z.size(2)
+        mu, logvar = param
+
+        # (batch_size, 1, nz)
+        mu, logvar = mu.unsqueeze(1), logvar.unsqueeze(1)
+        var = logvar.exp()
+
+        # (batch_size, nsamples, nz)
+        dev = z - mu
+
+        # (batch_size, nsamples)
+        log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+            0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+        return log_density
+
+
+
+    def calc_mi(self, test_data_batch, args):
+        # calc_mi_v3
+        import math 
+        from modules.utils import log_sum_exp
+
+        mi = 0
+        num_examples = 0
+
+        mu_batch_list, logvar_batch_list = [], []
+        neg_entropy = 0.
+        for batch_data in test_data_batch:
+
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+
+            (batch_size, nz)
+            mu, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+
+            x_batch, nz = mu.size()
+
+            #print(x_batch, end=' ')
+
+            num_examples += x_batch
+
+            # E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
+
+            neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()
+            mu_batch_list += [mu.cpu()]
+            logvar_batch_list += [logvar.cpu()]
+
+            pdb.set_trace()
+
+        neg_entropy = neg_entropy / num_examples
+        ##print()
+
+        num_examples = 0
+        log_qz = 0.
+        for i in range(len(mu_batch_list)):
+            ###############
+            # get z_samples
+            ###############
+            mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+            
+            # [z_batch, 1, nz]
+
+            z_samples = self.reparameterize(mu, logvar, 1)
+
+            z_samples = z_samples.view(-1, 1, nz)
+            num_examples += z_samples.size(0)
+
+            ###############
+            # compute density
+            ###############
+            # [1, x_batch, nz]
+            #mu, logvar = mu_batch_list[i].cuda(), logvar_batch_list[i].cuda()
+            #indices = list(np.random.choice(np.arange(len(mu_batch_list)), 10)) + [i]
+            indices = np.arange(len(mu_batch_list))
+            mu = torch.cat([mu_batch_list[_] for _ in indices], dim=0).cuda()
+            logvar = torch.cat([logvar_batch_list[_] for _ in indices], dim=0).cuda()
+            x_batch, nz = mu.size()
+
+            mu, logvar = mu.unsqueeze(0), logvar.unsqueeze(0)
+            var = logvar.exp()
+
+            # (z_batch, x_batch, nz)
+            dev = z_samples - mu
+
+            # (z_batch, x_batch)
+            log_density = -0.5 * ((dev ** 2) / var).sum(dim=-1) - \
+                0.5 * (nz * math.log(2 * math.pi) + logvar.sum(-1))
+
+            # log q(z): aggregate posterior
+            # [z_batch]
+            log_qz += (log_sum_exp(log_density, dim=1) - math.log(x_batch)).sum(-1)
+
+        log_qz /= num_examples
+        mi = neg_entropy - log_qz
+
+        return mi
+
+
+
+    def calc_au(self, eval_dataloader, args, delta=0.01):
+        """compute the number of active units
+        """
+        cnt = 0
+        for batch_data in eval_dataloader:
+
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+
+            # (batch_size, nz)
+            mean, logvar = self.encoder.linear(bert_fea).chunk(2, -1)
+
+            if cnt == 0:
+                means_sum = mean.sum(dim=0, keepdim=True)
+            else:
+                means_sum = means_sum + mean.sum(dim=0, keepdim=True)
+            cnt += mean.size(0)
+
+        # (1, nz)
+        mean_mean = means_sum / cnt
+
+        cnt = 0
+        for batch_data in eval_dataloader:
+
+            x0, _, _ = batch_data
+            x0 = x0.to(args.device)
+
+            # encoding into bert features
+            bert_fea = self.encoder(x0)[1]
+
+            # (batch_size, nz)
+            mean, _ = self.encoder.linear(bert_fea).chunk(2, -1)
+
+            if cnt == 0:
+                var_sum = ((mean - mean_mean) ** 2).sum(dim=0)
+            else:
+                var_sum = var_sum + ((mean - mean_mean) ** 2).sum(dim=0)
+            cnt += mean.size(0)
+
+        # (nz)
+        au_var = var_sum / (cnt - 1)
+
+        return (au_var >= delta).sum().item(), au_var
+
diff --git a/Optimus/code/pytorch_transformers/__init__.py b/Optimus/code/pytorch_transformers/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..1a53e07f0843a68264ebaebefc6c22a2a552c73b
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/__init__.py
@@ -0,0 +1,75 @@
+__version__ = "1.2.0"
+# Work around to update TensorFlow's absl.logging threshold which alters the
+# default Python logging output behavior when present.
+# see: https://github.com/abseil/abseil-py/issues/99
+# and: https://github.com/tensorflow/tensorflow/issues/26691#issuecomment-500369493
+try:
+    import absl.logging
+    absl.logging.set_verbosity('info')
+    absl.logging.set_stderrthreshold('info')
+    absl.logging._warn_preinit_stderr = False
+except:
+    pass
+
+# Tokenizer
+from .tokenization_utils import (PreTrainedTokenizer)
+from .tokenization_auto import AutoTokenizer
+from .tokenization_bert import BertTokenizer, BasicTokenizer, WordpieceTokenizer
+from .tokenization_openai import OpenAIGPTTokenizer
+from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
+from .tokenization_gpt2 import GPT2Tokenizer
+from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
+from .tokenization_xlm import XLMTokenizer
+from .tokenization_roberta import RobertaTokenizer
+from .tokenization_distilbert import DistilBertTokenizer
+
+# Configurations
+from .configuration_utils import PretrainedConfig
+from .configuration_auto import AutoConfig
+from .configuration_bert import BertConfig, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_openai import OpenAIGPTConfig, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_transfo_xl import TransfoXLConfig, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_gpt2 import GPT2Config, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_xlnet import XLNetConfig, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+# Modeling
+from .modeling_utils import (PreTrainedModel, prune_layer, Conv1D)
+from .modeling_auto import (AutoModel, AutoModelForSequenceClassification, AutoModelForQuestionAnswering,
+                            AutoModelWithLMHead)
+
+from .modeling_bert import (BertPreTrainedModel, BertModel, BertForLatentConnector, BertForPreTraining,BertForSequenceClassificationLatentConnector,
+                            BertForMaskedLM, BertForNextSentencePrediction,
+                            BertForSequenceClassification, BertForMultipleChoice,
+                            BertForTokenClassification, BertForQuestionAnswering,
+                            load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_openai import (OpenAIGPTPreTrainedModel, OpenAIGPTModel,
+                              OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
+                              load_tf_weights_in_openai_gpt, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_transfo_xl import (TransfoXLPreTrainedModel, TransfoXLModel, TransfoXLLMHeadModel,
+                                  load_tf_weights_in_transfo_xl, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_gpt2 import (GPT2PreTrainedModel, GPT2Model, GPT2ForLatentConnector, 
+                            GPT2LMHeadModel, GPT2DoubleHeadsModel,
+                            load_tf_weights_in_gpt2, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_xlnet import (XLNetPreTrainedModel, XLNetModel, XLNetLMHeadModel,
+                             XLNetForSequenceClassification, XLNetForQuestionAnswering, XLNetForMultipleChoice,
+                             load_tf_weights_in_xlnet, XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_xlm import (XLMPreTrainedModel , XLMModel,
+                           XLMWithLMHeadModel, XLMForSequenceClassification,
+                           XLMForQuestionAnswering, XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_roberta import (RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
+                               RobertaForMultipleChoice, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_distilbert import (DistilBertForMaskedLM, DistilBertModel,
+                               DistilBertForSequenceClassification, DistilBertForQuestionAnswering,
+                               DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
+
+# Optimization
+from .optimization import (AdamW, ConstantLRSchedule, WarmupConstantSchedule, WarmupCosineSchedule,
+                           WarmupCosineWithHardRestartsSchedule, WarmupLinearSchedule)
+
+# Files and general utilities
+from .file_utils import (PYTORCH_TRANSFORMERS_CACHE, PYTORCH_PRETRAINED_BERT_CACHE,
+                         cached_path, add_start_docstrings, add_end_docstrings,
+                         WEIGHTS_NAME, TF_WEIGHTS_NAME, CONFIG_NAME)
diff --git a/Optimus/code/pytorch_transformers/__main__.py b/Optimus/code/pytorch_transformers/__main__.py
new file mode 100755
index 0000000000000000000000000000000000000000..b047fa74473ee265a6fac31914a77179957ba01f
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/__main__.py
@@ -0,0 +1,128 @@
+# coding: utf8
+def main():
+    import sys
+    if (len(sys.argv) < 4 or len(sys.argv) > 6) or sys.argv[1] not in ["bert", "gpt", "transfo_xl", "gpt2", "xlnet", "xlm"]:
+        print(
+        "Should be used as one of: \n"
+        ">> pytorch_transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT, \n"
+        ">> pytorch_transformers gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG], \n"
+        ">> pytorch_transformers transfo_xl TF_CHECKPOINT_OR_DATASET PYTORCH_DUMP_OUTPUT [TF_CONFIG] or \n"
+        ">> pytorch_transformers gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [GPT2_CONFIG] or \n"
+        ">> pytorch_transformers xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME] or \n"
+        ">> pytorch_transformers xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT")
+    else:
+        if sys.argv[1] == "bert":
+            try:
+                from .convert_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch
+            except ImportError:
+                print("pytorch_transformers can only be used from the commandline to convert TensorFlow models in PyTorch, "
+                    "In that case, it requires TensorFlow to be installed. Please see "
+                    "https://www.tensorflow.org/install/ for installation instructions.")
+                raise
+
+            if len(sys.argv) != 5:
+                # pylint: disable=line-too-long
+                print("Should be used as `pytorch_transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`")
+            else:
+                PYTORCH_DUMP_OUTPUT = sys.argv.pop()
+                TF_CONFIG = sys.argv.pop()
+                TF_CHECKPOINT = sys.argv.pop()
+                convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)
+        elif sys.argv[1] == "gpt":
+            from .convert_openai_checkpoint_to_pytorch import convert_openai_checkpoint_to_pytorch
+            if len(sys.argv) < 4 or len(sys.argv) > 5:
+                # pylint: disable=line-too-long
+                print("Should be used as `pytorch_transformers gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG]`")
+            else:
+                OPENAI_GPT_CHECKPOINT_FOLDER_PATH = sys.argv[2]
+                PYTORCH_DUMP_OUTPUT = sys.argv[3]
+                if len(sys.argv) == 5:
+                    OPENAI_GPT_CONFIG = sys.argv[4]
+                else:
+                    OPENAI_GPT_CONFIG = ""
+                convert_openai_checkpoint_to_pytorch(OPENAI_GPT_CHECKPOINT_FOLDER_PATH,
+                                                    OPENAI_GPT_CONFIG,
+                                                    PYTORCH_DUMP_OUTPUT)
+        elif sys.argv[1] == "transfo_xl":
+            try:
+                from .convert_transfo_xl_checkpoint_to_pytorch import convert_transfo_xl_checkpoint_to_pytorch
+            except ImportError:
+                print("pytorch_transformers can only be used from the commandline to convert TensorFlow models in PyTorch, "
+                    "In that case, it requires TensorFlow to be installed. Please see "
+                    "https://www.tensorflow.org/install/ for installation instructions.")
+                raise
+            if len(sys.argv) < 4 or len(sys.argv) > 5:
+                # pylint: disable=line-too-long
+                print("Should be used as `pytorch_transformers transfo_xl TF_CHECKPOINT/TF_DATASET_FILE PYTORCH_DUMP_OUTPUT [TF_CONFIG]`")
+            else:
+                if 'ckpt' in sys.argv[2].lower():
+                    TF_CHECKPOINT = sys.argv[2]
+                    TF_DATASET_FILE = ""
+                else:
+                    TF_DATASET_FILE = sys.argv[2]
+                    TF_CHECKPOINT = ""
+                PYTORCH_DUMP_OUTPUT = sys.argv[3]
+                if len(sys.argv) == 5:
+                    TF_CONFIG = sys.argv[4]
+                else:
+                    TF_CONFIG = ""
+                convert_transfo_xl_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT, TF_DATASET_FILE)
+        elif sys.argv[1] == "gpt2":
+            try:
+                from .convert_gpt2_checkpoint_to_pytorch import convert_gpt2_checkpoint_to_pytorch
+            except ImportError:
+                print("pytorch_transformers can only be used from the commandline to convert TensorFlow models in PyTorch, "
+                    "In that case, it requires TensorFlow to be installed. Please see "
+                    "https://www.tensorflow.org/install/ for installation instructions.")
+                raise
+
+            if len(sys.argv) < 4 or len(sys.argv) > 5:
+                # pylint: disable=line-too-long
+                print("Should be used as `pytorch_transformers gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [TF_CONFIG]`")
+            else:
+                TF_CHECKPOINT = sys.argv[2]
+                PYTORCH_DUMP_OUTPUT = sys.argv[3]
+                if len(sys.argv) == 5:
+                    TF_CONFIG = sys.argv[4]
+                else:
+                    TF_CONFIG = ""
+                convert_gpt2_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)
+        elif sys.argv[1] == "xlnet":
+            try:
+                from .convert_xlnet_checkpoint_to_pytorch import convert_xlnet_checkpoint_to_pytorch
+            except ImportError:
+                print("pytorch_transformers can only be used from the commandline to convert TensorFlow models in PyTorch, "
+                    "In that case, it requires TensorFlow to be installed. Please see "
+                    "https://www.tensorflow.org/install/ for installation instructions.")
+                raise
+
+            if len(sys.argv) < 5 or len(sys.argv) > 6:
+                # pylint: disable=line-too-long
+                print("Should be used as `pytorch_transformers xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME]`")
+            else:
+                TF_CHECKPOINT = sys.argv[2]
+                TF_CONFIG = sys.argv[3]
+                PYTORCH_DUMP_OUTPUT = sys.argv[4]
+                if len(sys.argv) == 6:
+                    FINETUNING_TASK = sys.argv[5]
+                else:
+                    FINETUNING_TASK = None
+
+                convert_xlnet_checkpoint_to_pytorch(TF_CHECKPOINT,
+                                                    TF_CONFIG,
+                                                    PYTORCH_DUMP_OUTPUT,
+                                                    FINETUNING_TASK)
+        elif sys.argv[1] == "xlm":
+            from .convert_xlm_checkpoint_to_pytorch import convert_xlm_checkpoint_to_pytorch
+
+            if len(sys.argv) != 4:
+                # pylint: disable=line-too-long
+                print("Should be used as `pytorch_transformers xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT`")
+            else:
+                XLM_CHECKPOINT_PATH = sys.argv[2]
+                PYTORCH_DUMP_OUTPUT = sys.argv[3]
+
+                convert_xlm_checkpoint_to_pytorch(XLM_CHECKPOINT_PATH, PYTORCH_DUMP_OUTPUT)
+
+if __name__ == '__main__':
+    main()
diff --git a/Optimus/code/pytorch_transformers/configuration_auto.py b/Optimus/code/pytorch_transformers/configuration_auto.py
new file mode 100755
index 0000000000000000000000000000000000000000..9e35f85dc748082d7a129bbd42b310ccc4fdec92
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_auto.py
@@ -0,0 +1,135 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Model class. """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+
+from .configuration_bert import BertConfig
+from .configuration_openai import OpenAIGPTConfig
+from .configuration_gpt2 import GPT2Config
+from .configuration_transfo_xl import TransfoXLConfig
+from .configuration_xlnet import XLNetConfig
+from .configuration_xlm import XLMConfig
+from .configuration_roberta import RobertaConfig
+from .configuration_distilbert import DistilBertConfig
+
+logger = logging.getLogger(__name__)
+
+
+class AutoConfig(object):
+    r""":class:`~pytorch_transformers.AutoConfig` is a generic configuration class
+        that will be instantiated as one of the configuration classes of the library
+        when created with the `AutoConfig.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method take care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertConfig (DistilBERT model)
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+            - contains `roberta`: RobertaConfig (RoBERTa model)
+
+        This class cannot be instantiated using `__init__()` (throw an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoConfig is designed to be instantiated "
+            "using the `AutoConfig.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r""" Instantiate a one of the configuration classes of the library
+        from a pre-trained model configuration.
+
+        The configuration class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertConfig (DistilBERT model)
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+            - contains `roberta`: RobertaConfig (RoBERTa model)
+
+        Params:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing a configuration file saved using the :func:`~pytorch_transformers.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.
+                - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            kwargs: (`optional`) dict: key/value pairs with which to update the configuration object after loading.
+
+                - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.
+                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            return_unused_kwargs: (`optional`) bool:
+
+                - If False, then this function returns just the final configuration object.
+                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
+
+        Examples::
+
+            config = AutoConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
+            config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
+            config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')
+            config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
+            assert config.output_attention == True
+            config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,
+                                                               foo=False, return_unused_kwargs=True)
+            assert config.output_attention == True
+            assert unused_kwargs == {'foo': False}
+
+        """
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
+            return BertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            return OpenAIGPTConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'gpt2' in pretrained_model_name_or_path:
+            return GPT2Config.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            return TransfoXLConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
diff --git a/Optimus/code/pytorch_transformers/configuration_bert.py b/Optimus/code/pytorch_transformers/configuration_bert.py
new file mode 100755
index 0000000000000000000000000000000000000000..7fff3e5d058720900fb0388b3c54e31e86045a71
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_bert.py
@@ -0,0 +1,113 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BERT model configuration """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import sys
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json",
+    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json",
+    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json",
+    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json",
+    'bert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json",
+    'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json",
+    'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json",
+    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json",
+    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json",
+    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
+}
+
+
+class BertConfig(PretrainedConfig):
+    r"""
+        :class:`~pytorch_transformers.BertConfig` is the configuration class to store the configuration of a
+        `BertModel`.
+
+
+        Arguments:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
+            hidden_size: Size of the encoder layers and the pooler layer.
+            num_hidden_layers: Number of hidden layers in the Transformer encoder.
+            num_attention_heads: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            intermediate_size: The size of the "intermediate" (i.e., feed-forward)
+                layer in the Transformer encoder.
+            hidden_act: The non-linear activation function (function or string) in the
+                encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+            hidden_dropout_prob: The dropout probabilitiy for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            attention_probs_dropout_prob: The dropout ratio for the attention
+                probabilities.
+            max_position_embeddings: The maximum sequence length that this model might
+                ever be used with. Typically set this to something large just in case
+                (e.g., 512 or 1024 or 2048).
+            type_vocab_size: The vocabulary size of the `token_type_ids` passed into
+                `BertModel`.
+            initializer_range: The sttdev of the truncated_normal_initializer for
+                initializing all weight matrices.
+            layer_norm_eps: The epsilon used by LayerNorm.
+    """
+    pretrained_config_archive_map = BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(self,
+                 vocab_size_or_config_json_file=30522,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=2,
+                 initializer_range=0.02,
+                 layer_norm_eps=1e-12,
+                 **kwargs):
+        super(BertConfig, self).__init__(**kwargs)
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.vocab_size = vocab_size_or_config_json_file
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.hidden_act = hidden_act
+            self.intermediate_size = intermediate_size
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.initializer_range = initializer_range
+            self.layer_norm_eps = layer_norm_eps
+        else:
+            raise ValueError("First argument must be either a vocabulary size (int)"
+                             " or the path to a pretrained model config file (str)")
diff --git a/Optimus/code/pytorch_transformers/configuration_distilbert.py b/Optimus/code/pytorch_transformers/configuration_distilbert.py
new file mode 100755
index 0000000000000000000000000000000000000000..b8929eedec763346fb8da423919bed9d3fd61c85
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_distilbert.py
@@ -0,0 +1,89 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" DistilBERT model configuration """
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import sys
+import json
+import logging
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json",
+    'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json"
+}
+
+
+class DistilBertConfig(PretrainedConfig):
+    pretrained_config_archive_map = DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(self,
+                 vocab_size_or_config_json_file=30522,
+                 max_position_embeddings=512,
+                 sinusoidal_pos_embds=True,
+                 n_layers=6,
+                 n_heads=12,
+                 dim=768,
+                 hidden_dim=4*768,
+                 dropout=0.1,
+                 attention_dropout=0.1,
+                 activation='gelu',
+                 initializer_range=0.02,
+                 tie_weights_=True,
+                 qa_dropout=0.1,
+                 seq_classif_dropout=0.2,
+                 **kwargs):
+        super(DistilBertConfig, self).__init__(**kwargs)
+
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.vocab_size = vocab_size_or_config_json_file
+            self.max_position_embeddings = max_position_embeddings
+            self.sinusoidal_pos_embds = sinusoidal_pos_embds
+            self.n_layers = n_layers
+            self.n_heads = n_heads
+            self.dim = dim
+            self.hidden_dim = hidden_dim
+            self.dropout = dropout
+            self.attention_dropout = attention_dropout
+            self.activation = activation
+            self.initializer_range = initializer_range
+            self.tie_weights_ = tie_weights_
+            self.qa_dropout = qa_dropout
+            self.seq_classif_dropout = seq_classif_dropout
+        else:
+            raise ValueError("First argument must be either a vocabulary size (int)"
+                             " or the path to a pretrained model config file (str)")
+    @property
+    def hidden_size(self):
+        return self.dim
+
+    @property
+    def num_attention_heads(self):
+        return self.n_heads
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layers
diff --git a/Optimus/code/pytorch_transformers/configuration_gpt2.py b/Optimus/code/pytorch_transformers/configuration_gpt2.py
new file mode 100755
index 0000000000000000000000000000000000000000..c83d9e82cef82f28b1caa443569bc407217439f7
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_gpt2.py
@@ -0,0 +1,143 @@
+# coding=utf-8
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" OpenAI GPT-2 configuration """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import sys
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json",
+                                      "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json",
+                                      "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json"}
+
+class GPT2Config(PretrainedConfig):
+    """Configuration class to store the configuration of a `GPT2Model`.
+
+    Args:
+        vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `GPT2Model` or a configuration json file.
+        n_positions: Number of positional embeddings.
+        n_ctx: Size of the causal mask (usually same as n_positions).
+        n_embd: Dimensionality of the embeddings and hidden states.
+        n_layer: Number of hidden layers in the Transformer encoder.
+        n_head: Number of attention heads for each attention layer in
+            the Transformer encoder.
+        layer_norm_epsilon: epsilon to use in the layer norm layers
+        resid_pdrop: The dropout probabilitiy for all fully connected
+            layers in the embeddings, encoder, and pooler.
+        attn_pdrop: The dropout ratio for the attention
+            probabilities.
+        embd_pdrop: The dropout ratio for the embeddings.
+        initializer_range: The sttdev of the truncated_normal_initializer for
+            initializing all weight matrices.
+    """
+    pretrained_config_archive_map = GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(
+        self,
+        vocab_size_or_config_json_file=50257,
+        n_positions=1024,
+        n_ctx=1024,
+        n_embd=768,
+        n_layer=12,
+        n_head=12,
+        resid_pdrop=0.1,
+        embd_pdrop=0.1,
+        attn_pdrop=0.1,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+
+        num_labels=1,
+        summary_type='cls_index',
+        summary_use_proj=True,
+        summary_activation=None,
+        summary_proj_to_labels=True,
+        summary_first_dropout=0.1,
+        **kwargs
+    ):
+        """Constructs GPT2Config.
+
+        Args:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `GPT2Model` or a configuration json file.
+            n_positions: Number of positional embeddings.
+            n_ctx: Size of the causal mask (usually same as n_positions).
+            n_embd: Dimensionality of the embeddings and hidden states.
+            n_layer: Number of hidden layers in the Transformer encoder.
+            n_head: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            layer_norm_epsilon: epsilon to use in the layer norm layers
+            resid_pdrop: The dropout probabilitiy for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            attn_pdrop: The dropout ratio for the attention
+                probabilities.
+            embd_pdrop: The dropout ratio for the embeddings.
+            initializer_range: The sttdev of the truncated_normal_initializer for
+                initializing all weight matrices.
+        """
+        super(GPT2Config, self).__init__(**kwargs)
+
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.vocab_size = vocab_size_or_config_json_file
+            self.n_ctx = n_ctx
+            self.n_positions = n_positions
+            self.n_embd = n_embd
+            self.n_layer = n_layer
+            self.n_head = n_head
+            self.resid_pdrop = resid_pdrop
+            self.embd_pdrop = embd_pdrop
+            self.attn_pdrop = attn_pdrop
+            self.layer_norm_epsilon = layer_norm_epsilon
+            self.initializer_range = initializer_range
+
+            self.num_labels = num_labels
+            self.summary_type = summary_type
+            self.summary_use_proj = summary_use_proj
+            self.summary_activation = summary_activation
+            self.summary_first_dropout = summary_first_dropout
+            self.summary_proj_to_labels = summary_proj_to_labels
+        else:
+            raise ValueError(
+                "First argument must be either a vocabulary size (int)"
+                "or the path to a pretrained model config file (str)"
+            )
+
+    @property
+    def max_position_embeddings(self):
+        return self.n_positions
+
+    @property
+    def hidden_size(self):
+        return self.n_embd
+
+    @property
+    def num_attention_heads(self):
+        return self.n_head
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layer
diff --git a/Optimus/code/pytorch_transformers/configuration_openai.py b/Optimus/code/pytorch_transformers/configuration_openai.py
new file mode 100755
index 0000000000000000000000000000000000000000..b27df5689982add1dde24e127a152a40c7c1ac78
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_openai.py
@@ -0,0 +1,135 @@
+# coding=utf-8
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" OpenAI GPT configuration """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import sys
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "openai-gpt": "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json"
+}
+
+class OpenAIGPTConfig(PretrainedConfig):
+    """
+    Configuration class to store the configuration of a `OpenAIGPTModel`.
+
+    Args:
+        vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `OpenAIGPTModel` or a configuration json file.
+        n_special: The number of special tokens to learn during fine-tuning ('[SEP]', '[CLF]', ...)
+        n_positions: Number of positional embeddings.
+        n_ctx: Size of the causal mask (usually same as n_positions).
+        n_embd: Dimensionality of the embeddings and hidden states.
+        n_layer: Number of hidden layers in the Transformer encoder.
+        n_head: Number of attention heads for each attention layer in
+            the Transformer encoder.
+        afn: The non-linear activation function (function or string) in the
+            encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+        resid_pdrop: The dropout probabilitiy for all fully connected
+            layers in the embeddings, encoder, and pooler.
+        attn_pdrop: The dropout ratio for the attention
+            probabilities.
+        embd_pdrop: The dropout ratio for the embeddings.
+        layer_norm_epsilon: epsilon to use in the layer norm layers
+        initializer_range: The sttdev of the truncated_normal_initializer for
+            initializing all weight matrices.
+        predict_special_tokens: should we predict special tokens (when the model has a LM head)
+    """
+    pretrained_config_archive_map = OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(
+        self,
+        vocab_size_or_config_json_file=40478,
+        n_positions=512,
+        n_ctx=512,
+        n_embd=768,
+        n_layer=12,
+        n_head=12,
+        afn="gelu",
+        resid_pdrop=0.1,
+        embd_pdrop=0.1,
+        attn_pdrop=0.1,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        predict_special_tokens=True,
+
+        num_labels=1,
+        summary_type='cls_index',
+        summary_use_proj=True,
+        summary_activation=None,
+        summary_proj_to_labels=True,
+        summary_first_dropout=0.1,
+        **kwargs
+    ):
+        """Constructs OpenAIGPTConfig.
+        """
+        super(OpenAIGPTConfig, self).__init__(**kwargs)
+
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.vocab_size = vocab_size_or_config_json_file
+            self.n_ctx = n_ctx
+            self.n_positions = n_positions
+            self.n_embd = n_embd
+            self.n_layer = n_layer
+            self.n_head = n_head
+            self.afn = afn
+            self.resid_pdrop = resid_pdrop
+            self.embd_pdrop = embd_pdrop
+            self.attn_pdrop = attn_pdrop
+            self.layer_norm_epsilon = layer_norm_epsilon
+            self.initializer_range = initializer_range
+            self.predict_special_tokens = predict_special_tokens
+
+            self.num_labels = num_labels
+            self.summary_type = summary_type
+            self.summary_use_proj = summary_use_proj
+            self.summary_activation = summary_activation
+            self.summary_first_dropout = summary_first_dropout
+            self.summary_proj_to_labels = summary_proj_to_labels
+        else:
+            raise ValueError(
+                "First argument must be either a vocabulary size (int)"
+                "or the path to a pretrained model config file (str)"
+            )
+
+    @property
+    def max_position_embeddings(self):
+        return self.n_positions
+
+    @property
+    def hidden_size(self):
+        return self.n_embd
+
+    @property
+    def num_attention_heads(self):
+        return self.n_head
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layer
diff --git a/Optimus/code/pytorch_transformers/configuration_roberta.py b/Optimus/code/pytorch_transformers/configuration_roberta.py
new file mode 100755
index 0000000000000000000000000000000000000000..b92d6a908ba625ba4e507ad5be09e492e1cf0e3e
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_roberta.py
@@ -0,0 +1,35 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" RoBERTa configuration """
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import logging
+
+from .configuration_bert import BertConfig
+
+logger = logging.getLogger(__name__)
+
+ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json",
+    'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json",
+    'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json",
+}
+
+
+class RobertaConfig(BertConfig):
+    pretrained_config_archive_map = ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
diff --git a/Optimus/code/pytorch_transformers/configuration_transfo_xl.py b/Optimus/code/pytorch_transformers/configuration_transfo_xl.py
new file mode 100755
index 0000000000000000000000000000000000000000..2e966ee55cf4583d3f7973ccc865e2c7443021e0
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_transfo_xl.py
@@ -0,0 +1,167 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Transformer XL configuration """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import sys
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-config.json",
+}
+
+class TransfoXLConfig(PretrainedConfig):
+    """Configuration class to store the configuration of a `TransfoXLModel`.
+
+        Args:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `TransfoXLModel` or a configuration json file.
+            cutoffs: cutoffs for the adaptive softmax
+            d_model: Dimensionality of the model's hidden states.
+            d_embed: Dimensionality of the embeddings
+            d_head: Dimensionality of the model's heads.
+            div_val: divident value for adapative input and softmax
+            pre_lnorm: apply LayerNorm to the input instead of the output
+            d_inner: Inner dimension in FF
+            n_layer: Number of hidden layers in the Transformer encoder.
+            n_head: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            tgt_len: number of tokens to predict
+            ext_len: length of the extended context
+            mem_len: length of the retained previous heads
+            same_length: use the same attn length for all tokens
+            proj_share_all_but_first: True to share all but first projs, False not to share.
+            attn_type: attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
+            clamp_len: use the same pos embeddings after clamp_len
+            sample_softmax: number of samples in sampled softmax
+            adaptive: use adaptive softmax
+            tie_weight: tie the word embedding and softmax weights
+            dropout: The dropout probabilitiy for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            dropatt: The dropout ratio for the attention probabilities.
+            untie_r: untie relative position biases
+            embd_pdrop: The dropout ratio for the embeddings.
+            init: parameter initializer to use
+            init_range: parameters initialized by U(-init_range, init_range).
+            proj_init_std: parameters initialized by N(0, init_std)
+            init_std: parameters initialized by N(0, init_std)
+    """
+    pretrained_config_archive_map = TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(self,
+                 vocab_size_or_config_json_file=267735,
+                 cutoffs=[20000, 40000, 200000],
+                 d_model=1024,
+                 d_embed=1024,
+                 n_head=16,
+                 d_head=64,
+                 d_inner=4096,
+                 div_val=4,
+                 pre_lnorm=False,
+                 n_layer=18,
+                 tgt_len=128,
+                 ext_len=0,
+                 mem_len=1600,
+                 clamp_len=1000,
+                 same_length=True,
+                 proj_share_all_but_first=True,
+                 attn_type=0,
+                 sample_softmax=-1,
+                 adaptive=True,
+                 tie_weight=True,
+                 dropout=0.1,
+                 dropatt=0.0,
+                 untie_r=True,
+                 init="normal",
+                 init_range=0.01,
+                 proj_init_std=0.01,
+                 init_std=0.02,
+                 **kwargs):
+        """Constructs TransfoXLConfig.
+        """
+        super(TransfoXLConfig, self).__init__(**kwargs)
+
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.n_token = vocab_size_or_config_json_file
+            self.cutoffs = []
+            self.cutoffs.extend(cutoffs)
+            self.tie_weight = tie_weight
+            if proj_share_all_but_first:
+                self.tie_projs = [False] + [True] * len(self.cutoffs)
+            else:
+                self.tie_projs = [False] + [False] * len(self.cutoffs)
+            self.d_model = d_model
+            self.d_embed = d_embed
+            self.d_head = d_head
+            self.d_inner = d_inner
+            self.div_val = div_val
+            self.pre_lnorm = pre_lnorm
+            self.n_layer = n_layer
+            self.n_head = n_head
+            self.tgt_len = tgt_len
+            self.ext_len = ext_len
+            self.mem_len = mem_len
+            self.same_length = same_length
+            self.attn_type = attn_type
+            self.clamp_len = clamp_len
+            self.sample_softmax = sample_softmax
+            self.adaptive = adaptive
+            self.dropout = dropout
+            self.dropatt = dropatt
+            self.untie_r = untie_r
+            self.init = init
+            self.init_range = init_range
+            self.proj_init_std = proj_init_std
+            self.init_std = init_std
+        else:
+            raise ValueError("First argument must be either a vocabulary size (int)"
+                             " or the path to a pretrained model config file (str)")
+
+    @property
+    def max_position_embeddings(self):
+        return self.tgt_len + self.ext_len + self.mem_len
+
+    @property
+    def vocab_size(self):
+        return self.n_token
+
+    @vocab_size.setter
+    def vocab_size(self, value):
+        self.n_token = value
+
+    @property
+    def hidden_size(self):
+        return self.d_model
+
+    @property
+    def num_attention_heads(self):
+        return self.n_head
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layer
diff --git a/Optimus/code/pytorch_transformers/configuration_utils.py b/Optimus/code/pytorch_transformers/configuration_utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..7efc735d4132124cd3d097cc1844f4407551b1db
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_utils.py
@@ -0,0 +1,205 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Configuration base class and utilities."""
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import copy
+import json
+import logging
+import os
+from io import open
+
+from .file_utils import cached_path, CONFIG_NAME
+
+logger = logging.getLogger(__name__)
+
+class PretrainedConfig(object):
+    r""" Base class for all configuration classes.
+        Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.
+
+        Note:
+            A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does **not** load the model weights.
+            It only affects the model's configuration.
+
+        Class attributes (overridden by derived classes):
+            - ``pretrained_config_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained model configurations as values.
+
+        Parameters:
+            ``finetuning_task``: string, default `None`. Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.
+            ``num_labels``: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
+            ``output_attentions``: boolean, default `False`. Should the model returns attentions weights.
+            ``output_hidden_states``: string, default `False`. Should the model returns all hidden-states.
+            ``torchscript``: string, default `False`. Is the model used with Torchscript.
+    """
+    pretrained_config_archive_map = {}
+
+    def __init__(self, **kwargs):
+        self.finetuning_task = kwargs.pop('finetuning_task', None)
+        self.num_labels = kwargs.pop('num_labels', 2)
+        self.output_attentions = kwargs.pop('output_attentions', False)
+        self.output_hidden_states = kwargs.pop('output_hidden_states', False)
+        self.torchscript = kwargs.pop('torchscript', False)
+        self.pruned_heads = kwargs.pop('pruned_heads', {})
+
+    def save_pretrained(self, save_directory):
+        """ Save a configuration object to the directory `save_directory`, so that it
+            can be re-loaded using the :func:`~pytorch_transformers.PretrainedConfig.from_pretrained` class method.
+        """
+        assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
+
+        # If we save using the predefined names, we can load using `from_pretrained`
+        output_config_file = os.path.join(save_directory, CONFIG_NAME)
+
+        self.to_json_file(output_config_file)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r""" Instantiate a :class:`~pytorch_transformers.PretrainedConfig` (or a derived class) from a pre-trained model configuration.
+
+        Parameters:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing a configuration file saved using the :func:`~pytorch_transformers.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.
+                - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            kwargs: (`optional`) dict: key/value pairs with which to update the configuration object after loading.
+
+                - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.
+                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            return_unused_kwargs: (`optional`) bool:
+
+                - If False, then this function returns just the final configuration object.
+                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
+
+        Examples::
+
+            # We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a
+            # derived class: BertConfig
+            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
+            config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
+            config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')
+            config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
+            assert config.output_attention == True
+            config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,
+                                                               foo=False, return_unused_kwargs=True)
+            assert config.output_attention == True
+            assert unused_kwargs == {'foo': False}
+
+        """
+        cache_dir = kwargs.pop('cache_dir', None)
+        force_download = kwargs.pop('force_download', False)
+        proxies = kwargs.pop('proxies', None)
+        return_unused_kwargs = kwargs.pop('return_unused_kwargs', False)
+
+        if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
+            config_file = cls.pretrained_config_archive_map[pretrained_model_name_or_path]
+        elif os.path.isdir(pretrained_model_name_or_path):
+            config_file = os.path.join(pretrained_model_name_or_path, CONFIG_NAME)
+        else:
+            config_file = pretrained_model_name_or_path
+        # redirect to the cache, if necessary
+        try:
+            resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
+        except EnvironmentError as e:
+            if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
+                logger.error(
+                    "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
+                        config_file))
+            else:
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find any file "
+                    "associated to this path or url.".format(
+                        pretrained_model_name_or_path,
+                        ', '.join(cls.pretrained_config_archive_map.keys()),
+                        config_file))
+            raise e
+        if resolved_config_file == config_file:
+            logger.info("loading configuration file {}".format(config_file))
+        else:
+            logger.info("loading configuration file {} from cache at {}".format(
+                config_file, resolved_config_file))
+
+        # Load config
+        config = cls.from_json_file(resolved_config_file)
+
+        if hasattr(config, 'pruned_heads'):
+            config.pruned_heads = dict((int(key), set(value)) for key, value in config.pruned_heads.items())
+
+        # Update config with kwargs if needed
+        to_remove = []
+        for key, value in kwargs.items():
+            if hasattr(config, key):
+                setattr(config, key, value)
+                to_remove.append(key)
+        for key in to_remove:
+            kwargs.pop(key, None)
+
+        logger.info("Model config %s", config)
+        if return_unused_kwargs:
+            return config, kwargs
+        else:
+            return config
+
+    @classmethod
+    def from_dict(cls, json_object):
+        """Constructs a `Config` from a Python dictionary of parameters."""
+        config = cls(vocab_size_or_config_json_file=-1)
+        for key, value in json_object.items():
+            config.__dict__[key] = value
+        return config
+
+    @classmethod
+    def from_json_file(cls, json_file):
+        """Constructs a `BertConfig` from a json file of parameters."""
+        with open(json_file, "r", encoding='utf-8') as reader:
+            text = reader.read()
+        return cls.from_dict(json.loads(text))
+
+    def __eq__(self, other):
+        return self.__dict__ == other.__dict__
+
+    def __repr__(self):
+        return str(self.to_json_string())
+
+    def to_dict(self):
+        """Serializes this instance to a Python dictionary."""
+        output = copy.deepcopy(self.__dict__)
+        return output
+
+    def to_json_string(self):
+        """Serializes this instance to a JSON string."""
+        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
+
+    def to_json_file(self, json_file_path):
+        """ Save this instance to a json file."""
+        with open(json_file_path, "w", encoding='utf-8') as writer:
+            writer.write(self.to_json_string())
diff --git a/Optimus/code/pytorch_transformers/configuration_xlm.py b/Optimus/code/pytorch_transformers/configuration_xlm.py
new file mode 100755
index 0000000000000000000000000000000000000000..ab251c8939e0ec4c6066be8437b6acdac1caeb57
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_xlm.py
@@ -0,0 +1,184 @@
+# coding=utf-8
+# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" XLM configuration """
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import sys
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+XLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'xlm-mlm-en-2048': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-config.json",
+    'xlm-mlm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-config.json",
+    'xlm-mlm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-config.json",
+    'xlm-mlm-enro-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-config.json",
+    'xlm-mlm-tlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-config.json",
+    'xlm-mlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-config.json",
+    'xlm-clm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-config.json",
+    'xlm-clm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-config.json",
+    'xlm-mlm-17-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-config.json",
+    'xlm-mlm-100-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-config.json",
+}
+
+
+class XLMConfig(PretrainedConfig):
+    """Configuration class to store the configuration of a `XLMModel`.
+
+    Args:
+        vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `XLMModel`.
+        d_model: Size of the encoder layers and the pooler layer.
+        n_layer: Number of hidden layers in the Transformer encoder.
+        n_head: Number of attention heads for each attention layer in
+            the Transformer encoder.
+        d_inner: The size of the "intermediate" (i.e., feed-forward)
+            layer in the Transformer encoder.
+        ff_activation: The non-linear activation function (function or string) in the
+            encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+        untie_r: untie relative position biases
+        attn_type: 'bi' for XLM, 'uni' for Transformer-XL
+
+        dropout: The dropout probabilitiy for all fully connected
+            layers in the embeddings, encoder, and pooler.
+        dropatt: The dropout ratio for the attention
+            probabilities.
+        max_position_embeddings: The maximum sequence length that this model might
+            ever be used with. Typically set this to something large just in case
+            (e.g., 512 or 1024 or 2048).
+        initializer_range: The sttdev of the truncated_normal_initializer for
+            initializing all weight matrices.
+        layer_norm_eps: The epsilon used by LayerNorm.
+
+        dropout: float, dropout rate.
+        dropatt: float, dropout rate on attention probabilities.
+        init: str, the initialization scheme, either "normal" or "uniform".
+        init_range: float, initialize the parameters with a uniform distribution
+            in [-init_range, init_range]. Only effective when init="uniform".
+        init_std: float, initialize the parameters with a normal distribution
+            with mean 0 and stddev init_std. Only effective when init="normal".
+        mem_len: int, the number of tokens to cache.
+        reuse_len: int, the number of tokens in the currect batch to be cached
+            and reused in the future.
+        bi_data: bool, whether to use bidirectional input pipeline.
+            Usually set to True during pretraining and False during finetuning.
+        clamp_len: int, clamp all relative distances larger than clamp_len.
+            -1 means no clamping.
+        same_length: bool, whether to use the same attention length for each token.
+    """
+    pretrained_config_archive_map = XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(self,
+                 vocab_size_or_config_json_file=30145,
+                 emb_dim=2048,
+                 n_layers=12,
+                 n_heads=16,
+                 dropout=0.1,
+                 attention_dropout=0.1,
+                 gelu_activation=True,
+                 sinusoidal_embeddings=False,
+                 causal=False,
+                 asm=False,
+                 n_langs=1,
+                 use_lang_emb=True,
+                 max_position_embeddings=512,
+                 embed_init_std=2048 ** -0.5,
+                 layer_norm_eps=1e-12,
+                 init_std=0.02,
+                 bos_index=0,
+                 eos_index=1,
+                 pad_index=2,
+                 unk_index=3,
+                 mask_index=5,
+                 is_encoder=True,
+
+                 finetuning_task=None,
+                 num_labels=2,
+                 summary_type='first',
+                 summary_use_proj=True,
+                 summary_activation=None,
+                 summary_proj_to_labels=True,
+                 summary_first_dropout=0.1,
+                 start_n_top=5,
+                 end_n_top=5,
+                 **kwargs):
+        """Constructs XLMConfig.
+        """
+        super(XLMConfig, self).__init__(**kwargs)
+
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.n_words = vocab_size_or_config_json_file
+            self.emb_dim = emb_dim
+            self.n_layers = n_layers
+            self.n_heads = n_heads
+            self.dropout = dropout
+            self.attention_dropout = attention_dropout
+            self.gelu_activation = gelu_activation
+            self.sinusoidal_embeddings = sinusoidal_embeddings
+            self.causal = causal
+            self.asm = asm
+            self.n_langs = n_langs
+            self.use_lang_emb = use_lang_emb
+            self.layer_norm_eps = layer_norm_eps
+            self.bos_index = bos_index
+            self.eos_index = eos_index
+            self.pad_index = pad_index
+            self.unk_index = unk_index
+            self.mask_index = mask_index
+            self.is_encoder = is_encoder
+            self.max_position_embeddings = max_position_embeddings
+            self.embed_init_std = embed_init_std
+            self.init_std = init_std
+            self.finetuning_task = finetuning_task
+            self.num_labels = num_labels
+            self.summary_type = summary_type
+            self.summary_use_proj = summary_use_proj
+            self.summary_activation = summary_activation
+            self.summary_proj_to_labels = summary_proj_to_labels
+            self.summary_first_dropout = summary_first_dropout
+            self.start_n_top = start_n_top
+            self.end_n_top = end_n_top
+        else:
+            raise ValueError("First argument must be either a vocabulary size (int)"
+                             " or the path to a pretrained model config file (str)")
+
+    @property
+    def vocab_size(self):
+        return self.n_words
+
+    @vocab_size.setter
+    def vocab_size(self, value):
+        self.n_words = value
+
+    @property
+    def hidden_size(self):
+        return self.emb_dim
+
+    @property
+    def num_attention_heads(self):
+        return self.n_heads
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layers
diff --git a/Optimus/code/pytorch_transformers/configuration_xlnet.py b/Optimus/code/pytorch_transformers/configuration_xlnet.py
new file mode 100755
index 0000000000000000000000000000000000000000..204d44aa7281ca8e0d3c0c4ef73a92e9bb89da56
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/configuration_xlnet.py
@@ -0,0 +1,172 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" XLNet configuration """
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import sys
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'xlnet-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json",
+    'xlnet-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-config.json",
+}
+
+
+class XLNetConfig(PretrainedConfig):
+    """Configuration class to store the configuration of a ``XLNetModel``.
+
+    Args:
+        vocab_size_or_config_json_file: Vocabulary size of ``inputs_ids`` in ``XLNetModel``.
+        d_model: Size of the encoder layers and the pooler layer.
+        n_layer: Number of hidden layers in the Transformer encoder.
+        n_head: Number of attention heads for each attention layer in
+            the Transformer encoder.
+        d_inner: The size of the "intermediate" (i.e., feed-forward)
+            layer in the Transformer encoder.
+        ff_activation: The non-linear activation function (function or string) in the
+            encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+        untie_r: untie relative position biases
+        attn_type: 'bi' for XLNet, 'uni' for Transformer-XL
+
+        dropout: The dropout probabilitiy for all fully connected
+            layers in the embeddings, encoder, and pooler.
+        dropatt: The dropout ratio for the attention
+            probabilities.
+        initializer_range: The sttdev of the truncated_normal_initializer for
+            initializing all weight matrices.
+        layer_norm_eps: The epsilon used by LayerNorm.
+
+        dropout: float, dropout rate.
+        dropatt: float, dropout rate on attention probabilities.
+        init: str, the initialization scheme, either "normal" or "uniform".
+        init_range: float, initialize the parameters with a uniform distribution
+            in [-init_range, init_range]. Only effective when init="uniform".
+        init_std: float, initialize the parameters with a normal distribution
+            with mean 0 and stddev init_std. Only effective when init="normal".
+        mem_len: int, the number of tokens to cache.
+        reuse_len: int, the number of tokens in the currect batch to be cached
+            and reused in the future.
+        bi_data: bool, whether to use bidirectional input pipeline.
+            Usually set to True during pretraining and False during finetuning.
+        clamp_len: int, clamp all relative distances larger than clamp_len.
+            -1 means no clamping.
+        same_length: bool, whether to use the same attention length for each token.
+        finetuning_task: name of the glue task on which the model was fine-tuned if any
+    """
+    pretrained_config_archive_map = XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(self,
+                 vocab_size_or_config_json_file=32000,
+                 d_model=1024,
+                 n_layer=24,
+                 n_head=16,
+                 d_inner=4096,
+                 ff_activation="gelu",
+                 untie_r=True,
+                 attn_type="bi",
+
+                 initializer_range=0.02,
+                 layer_norm_eps=1e-12,
+
+                 dropout=0.1,
+                 mem_len=None,
+                 reuse_len=None,
+                 bi_data=False,
+                 clamp_len=-1,
+                 same_length=False,
+
+                 finetuning_task=None,
+                 num_labels=2,
+                 summary_type='last',
+                 summary_use_proj=True,
+                 summary_activation='tanh',
+                 summary_last_dropout=0.1,
+                 start_n_top=5,
+                 end_n_top=5,
+                 **kwargs):
+        """Constructs XLNetConfig.
+        """
+        super(XLNetConfig, self).__init__(**kwargs)
+
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.n_token = vocab_size_or_config_json_file
+            self.d_model = d_model
+            self.n_layer = n_layer
+            self.n_head = n_head
+            assert d_model % n_head == 0
+            self.d_head = d_model // n_head
+            self.ff_activation = ff_activation
+            self.d_inner = d_inner
+            self.untie_r = untie_r
+            self.attn_type = attn_type
+
+            self.initializer_range = initializer_range
+            self.layer_norm_eps = layer_norm_eps
+
+            self.dropout = dropout
+            self.mem_len = mem_len
+            self.reuse_len = reuse_len
+            self.bi_data = bi_data
+            self.clamp_len = clamp_len
+            self.same_length = same_length
+
+            self.finetuning_task = finetuning_task
+            self.num_labels = num_labels
+            self.summary_type = summary_type
+            self.summary_use_proj = summary_use_proj
+            self.summary_activation = summary_activation
+            self.summary_last_dropout = summary_last_dropout
+            self.start_n_top = start_n_top
+            self.end_n_top = end_n_top
+        else:
+            raise ValueError("First argument must be either a vocabulary size (int)"
+                             " or the path to a pretrained model config file (str)")
+
+    @property
+    def max_position_embeddings(self):
+        return -1
+
+    @property
+    def vocab_size(self):
+        return self.n_token
+
+    @vocab_size.setter
+    def vocab_size(self, value):
+        self.n_token = value
+
+    @property
+    def hidden_size(self):
+        return self.d_model
+
+    @property
+    def num_attention_heads(self):
+        return self.n_head
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layer
diff --git a/Optimus/code/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py b/Optimus/code/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
new file mode 100755
index 0000000000000000000000000000000000000000..eb5b3009b4ce6312bf0fac8a91b55e9550a95469
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
@@ -0,0 +1,75 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert OpenAI GPT checkpoint."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+from io import open
+
+import torch
+
+from pytorch_transformers import (CONFIG_NAME, WEIGHTS_NAME,
+                                                     GPT2Config,
+                                                     GPT2Model,
+                                                     load_tf_weights_in_gpt2)
+
+import logging
+logging.basicConfig(level=logging.INFO)
+
+
+def convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config_file, pytorch_dump_folder_path):
+    # Construct model
+    if gpt2_config_file == "":
+        config = GPT2Config()
+    else:
+        config = GPT2Config.from_json_file(gpt2_config_file)
+    model = GPT2Model(config)
+
+    # Load weights from numpy
+    load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path)
+
+    # Save pytorch-model
+    pytorch_weights_dump_path = pytorch_dump_folder_path + '/' + WEIGHTS_NAME
+    pytorch_config_dump_path = pytorch_dump_folder_path + '/' + CONFIG_NAME
+    print("Save PyTorch model to {}".format(pytorch_weights_dump_path))
+    torch.save(model.state_dict(), pytorch_weights_dump_path)
+    print("Save configuration file to {}".format(pytorch_config_dump_path))
+    with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
+        f.write(config.to_json_string())
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--gpt2_checkpoint_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the TensorFlow checkpoint path.")
+    parser.add_argument("--pytorch_dump_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the output PyTorch model.")
+    parser.add_argument("--gpt2_config_file",
+                        default = "",
+                        type = str,
+                        help = "An optional config json file corresponding to the pre-trained OpenAI model. \n"
+                            "This specifies the model architecture.")
+    args = parser.parse_args()
+    convert_gpt2_checkpoint_to_pytorch(args.gpt2_checkpoint_path,
+                                         args.gpt2_config_file,
+                                         args.pytorch_dump_folder_path)
diff --git a/Optimus/code/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py b/Optimus/code/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
new file mode 100755
index 0000000000000000000000000000000000000000..5eecdd9648c2ffcb6aebb5e58797cd64267c45a6
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
@@ -0,0 +1,75 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert OpenAI GPT checkpoint."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+from io import open
+
+import torch
+
+from pytorch_transformers import (CONFIG_NAME, WEIGHTS_NAME,
+                                                     OpenAIGPTConfig,
+                                                     OpenAIGPTModel,
+                                                     load_tf_weights_in_openai_gpt)
+
+import logging
+logging.basicConfig(level=logging.INFO)
+
+
+def convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, openai_config_file, pytorch_dump_folder_path):
+    # Construct model
+    if openai_config_file == "":
+        config = OpenAIGPTConfig()
+    else:
+        config = OpenAIGPTConfig.from_json_file(openai_config_file)
+    model = OpenAIGPTModel(config)
+
+    # Load weights from numpy
+    load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path)
+
+    # Save pytorch-model
+    pytorch_weights_dump_path = pytorch_dump_folder_path + '/' + WEIGHTS_NAME
+    pytorch_config_dump_path = pytorch_dump_folder_path + '/' + CONFIG_NAME
+    print("Save PyTorch model to {}".format(pytorch_weights_dump_path))
+    torch.save(model.state_dict(), pytorch_weights_dump_path)
+    print("Save configuration file to {}".format(pytorch_config_dump_path))
+    with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
+        f.write(config.to_json_string())
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--openai_checkpoint_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the TensorFlow checkpoint path.")
+    parser.add_argument("--pytorch_dump_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the output PyTorch model.")
+    parser.add_argument("--openai_config_file",
+                        default = "",
+                        type = str,
+                        help = "An optional config json file corresponding to the pre-trained OpenAI model. \n"
+                            "This specifies the model architecture.")
+    args = parser.parse_args()
+    convert_openai_checkpoint_to_pytorch(args.openai_checkpoint_folder_path,
+                                         args.openai_config_file,
+                                         args.pytorch_dump_folder_path)
diff --git a/Optimus/code/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py b/Optimus/code/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
new file mode 100755
index 0000000000000000000000000000000000000000..15fd6bf5acfc0caae4d92b0ddfdcd514c5fdf481
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
@@ -0,0 +1,130 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Convert Huggingface Pytorch checkpoint to Tensorflow checkpoint."""
+
+import os
+import argparse
+import torch
+import numpy as np
+import tensorflow as tf
+from pytorch_transformers import BertModel
+
+
+def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:str):
+
+    """
+    :param model:BertModel Pytorch model instance to be converted
+    :param ckpt_dir: Tensorflow model directory
+    :param model_name: model name
+    :return:
+
+    Currently supported HF models:
+        Y BertModel
+        N BertForMaskedLM
+        N BertForPreTraining
+        N BertForMultipleChoice
+        N BertForNextSentencePrediction
+        N BertForSequenceClassification
+        N BertForQuestionAnswering
+    """
+
+    tensors_to_transpose = (
+        "dense.weight",
+        "attention.self.query",
+        "attention.self.key",
+        "attention.self.value"
+    )
+
+    var_map = (
+        ('layer.', 'layer_'),
+        ('word_embeddings.weight', 'word_embeddings'),
+        ('position_embeddings.weight', 'position_embeddings'),
+        ('token_type_embeddings.weight', 'token_type_embeddings'),
+        ('.', '/'),
+        ('LayerNorm/weight', 'LayerNorm/gamma'),
+        ('LayerNorm/bias', 'LayerNorm/beta'),
+        ('weight', 'kernel')
+    )
+
+    if not os.path.isdir(ckpt_dir):
+        os.makedirs(ckpt_dir)
+
+    state_dict = model.state_dict()
+
+    def to_tf_var_name(name:str):
+        for patt, repl in iter(var_map):
+            name = name.replace(patt, repl)
+        return 'bert/{}'.format(name)
+
+    def create_tf_var(tensor:np.ndarray, name:str, session:tf.Session):
+        tf_dtype = tf.dtypes.as_dtype(tensor.dtype)
+        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())
+        session.run(tf.variables_initializer([tf_var]))
+        session.run(tf_var)
+        return tf_var
+
+    tf.reset_default_graph()
+    with tf.Session() as session:
+        for var_name in state_dict:
+            tf_name = to_tf_var_name(var_name)
+            torch_tensor = state_dict[var_name].numpy()
+            if any([x in var_name for x in tensors_to_transpose]):
+                torch_tensor = torch_tensor.T
+            tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)
+            tf.keras.backend.set_value(tf_var, torch_tensor)
+            tf_weight = session.run(tf_var)
+            print("Successfully created {}: {}".format(tf_name, np.allclose(tf_weight, torch_tensor)))
+
+        saver = tf.train.Saver(tf.trainable_variables())
+        saver.save(session, os.path.join(ckpt_dir, model_name.replace("-", "_") + ".ckpt"))
+
+
+def main(raw_args=None):
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name",
+                        type=str,
+                        required=True,
+                        help="model name e.g. bert-base-uncased")
+    parser.add_argument("--cache_dir",
+                        type=str,
+                        default=None,
+                        required=False,
+                        help="Directory containing pytorch model")
+    parser.add_argument("--pytorch_model_path",
+                        type=str,
+                        required=True,
+                        help="/path/to/<pytorch-model-name>.bin")
+    parser.add_argument("--tf_cache_dir",
+                        type=str,
+                        required=True,
+                        help="Directory in which to save tensorflow model")
+    args = parser.parse_args(raw_args)
+    
+    model = BertModel.from_pretrained(
+        pretrained_model_name_or_path=args.model_name,
+        state_dict=torch.load(args.pytorch_model_path),
+        cache_dir=args.cache_dir
+    )
+    
+    convert_pytorch_checkpoint_to_tf(
+        model=model,
+        ckpt_dir=args.tf_cache_dir,
+        model_name=args.model_name
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Optimus/code/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py b/Optimus/code/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
new file mode 100755
index 0000000000000000000000000000000000000000..9f74254daa8854ae9986084f054d7d5afa05d2dc
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
@@ -0,0 +1,180 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert RoBERTa checkpoint."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import logging
+import numpy as np
+import torch
+
+from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
+from fairseq.modules import TransformerSentenceEncoderLayer
+from pytorch_transformers import (BertConfig, BertEncoder,
+                                                BertIntermediate, BertLayer,
+                                                BertModel, BertOutput,
+                                                BertSelfAttention,
+                                                BertSelfOutput)
+from pytorch_transformers import (RobertaEmbeddings,
+                                                   RobertaForMaskedLM,
+                                                   RobertaForSequenceClassification,
+                                                   RobertaModel)
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+SAMPLE_TEXT = 'Hello world! cécé herlolip'
+
+
+def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_folder_path, classification_head):
+    """
+    Copy/paste/tweak roberta's weights to our BERT structure.
+    """
+    roberta = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
+    roberta.eval()  # disable dropout
+    config = BertConfig(
+        vocab_size_or_config_json_file=50265,
+        hidden_size=roberta.args.encoder_embed_dim,
+        num_hidden_layers=roberta.args.encoder_layers,
+        num_attention_heads=roberta.args.encoder_attention_heads,
+        intermediate_size=roberta.args.encoder_ffn_embed_dim,
+        max_position_embeddings=514,
+        type_vocab_size=1,
+        layer_norm_eps=1e-5, # PyTorch default used in fairseq
+    )
+    if classification_head:
+        config.num_labels = roberta.args.num_classes
+    print("Our BERT config:", config)
+
+    model = RobertaForSequenceClassification(config) if classification_head else RobertaForMaskedLM(config)
+    model.eval()
+
+    # Now let's copy all the weights.
+    # Embeddings
+    roberta_sent_encoder = roberta.model.decoder.sentence_encoder
+    model.roberta.embeddings.word_embeddings.weight = roberta_sent_encoder.embed_tokens.weight
+    model.roberta.embeddings.position_embeddings.weight = roberta_sent_encoder.embed_positions.weight
+    model.roberta.embeddings.token_type_embeddings.weight.data = torch.zeros_like(model.roberta.embeddings.token_type_embeddings.weight)  # just zero them out b/c RoBERTa doesn't use them.
+    model.roberta.embeddings.LayerNorm.weight = roberta_sent_encoder.emb_layer_norm.weight
+    model.roberta.embeddings.LayerNorm.bias = roberta_sent_encoder.emb_layer_norm.bias
+
+    for i in range(config.num_hidden_layers):
+        # Encoder: start of layer
+        layer: BertLayer = model.roberta.encoder.layer[i]
+        roberta_layer: TransformerSentenceEncoderLayer = roberta_sent_encoder.layers[i]
+
+        ### self attention
+        self_attn: BertSelfAttention = layer.attention.self
+        assert(
+            roberta_layer.self_attn.in_proj_weight.shape == torch.Size((3 * config.hidden_size, config.hidden_size))
+        )
+        # we use three distinct linear layers so we split the source layer here.
+        self_attn.query.weight.data = roberta_layer.self_attn.in_proj_weight[:config.hidden_size, :]
+        self_attn.query.bias.data = roberta_layer.self_attn.in_proj_bias[:config.hidden_size]
+        self_attn.key.weight.data = roberta_layer.self_attn.in_proj_weight[config.hidden_size:2*config.hidden_size, :]
+        self_attn.key.bias.data = roberta_layer.self_attn.in_proj_bias[config.hidden_size:2*config.hidden_size]
+        self_attn.value.weight.data = roberta_layer.self_attn.in_proj_weight[2*config.hidden_size:, :]
+        self_attn.value.bias.data = roberta_layer.self_attn.in_proj_bias[2*config.hidden_size:]
+
+        ### self-attention output
+        self_output: BertSelfOutput = layer.attention.output
+        assert(
+            self_output.dense.weight.shape == roberta_layer.self_attn.out_proj.weight.shape
+        )
+        self_output.dense.weight = roberta_layer.self_attn.out_proj.weight
+        self_output.dense.bias = roberta_layer.self_attn.out_proj.bias
+        self_output.LayerNorm.weight = roberta_layer.self_attn_layer_norm.weight
+        self_output.LayerNorm.bias = roberta_layer.self_attn_layer_norm.bias
+
+        ### intermediate
+        intermediate: BertIntermediate = layer.intermediate
+        assert(
+            intermediate.dense.weight.shape == roberta_layer.fc1.weight.shape
+        )
+        intermediate.dense.weight = roberta_layer.fc1.weight
+        intermediate.dense.bias = roberta_layer.fc1.bias
+
+        ### output
+        bert_output: BertOutput = layer.output
+        assert(
+            bert_output.dense.weight.shape == roberta_layer.fc2.weight.shape
+        )
+        bert_output.dense.weight = roberta_layer.fc2.weight
+        bert_output.dense.bias = roberta_layer.fc2.bias
+        bert_output.LayerNorm.weight = roberta_layer.final_layer_norm.weight
+        bert_output.LayerNorm.bias = roberta_layer.final_layer_norm.bias
+        #### end of layer
+    
+    if classification_head:
+        model.classifier.dense.weight = roberta.model.classification_heads['mnli'].dense.weight
+        model.classifier.dense.bias = roberta.model.classification_heads['mnli'].dense.bias
+        model.classifier.out_proj.weight = roberta.model.classification_heads['mnli'].out_proj.weight
+        model.classifier.out_proj.bias = roberta.model.classification_heads['mnli'].out_proj.bias
+    else:
+        # LM Head
+        model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight
+        model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias
+        model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight
+        model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias
+        model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight
+        model.lm_head.bias = roberta.model.decoder.lm_head.bias
+
+    # Let's check that we get the same results.
+    input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0) # batch of size 1
+
+    our_output = model(input_ids)[0]
+    if classification_head:
+        their_output = roberta.model.classification_heads['mnli'](roberta.extract_features(input_ids))
+    else:
+        their_output = roberta.model(input_ids)[0]
+    print(our_output.shape, their_output.shape)
+    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()
+    print(f"max_absolute_diff = {max_absolute_diff}") # ~ 1e-7
+    success = torch.allclose(our_output, their_output, atol=1e-3)
+    print(
+        "Do both models output the same tensors?",
+        "🔥" if success else "💩"
+    )
+    if not success:
+        raise Exception("Something went wRoNg")
+
+    print(f"Saving model to {pytorch_dump_folder_path}")
+    model.save_pretrained(pytorch_dump_folder_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--roberta_checkpoint_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path the official PyTorch dump.")
+    parser.add_argument("--pytorch_dump_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the output PyTorch model.")
+    parser.add_argument("--classification_head",
+                        action = "store_true",
+                        help = "Whether to convert a final classification head.")
+    args = parser.parse_args()
+    convert_roberta_checkpoint_to_pytorch(
+        args.roberta_checkpoint_path,
+        args.pytorch_dump_folder_path,
+        args.classification_head
+    )
+
diff --git a/Optimus/code/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py b/Optimus/code/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py
new file mode 100755
index 0000000000000000000000000000000000000000..d382d3588e2c475401985fe499da62673d6d4115
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py
@@ -0,0 +1,65 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert BERT checkpoint."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import torch
+
+from pytorch_transformers import BertConfig, BertForPreTraining, load_tf_weights_in_bert
+
+import logging
+logging.basicConfig(level=logging.INFO)
+
+def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):
+    # Initialise PyTorch model
+    config = BertConfig.from_json_file(bert_config_file)
+    print("Building PyTorch model from configuration: {}".format(str(config)))
+    model = BertForPreTraining(config)
+
+    # Load weights from tf checkpoint
+    load_tf_weights_in_bert(model, config, tf_checkpoint_path)
+
+    # Save pytorch-model
+    print("Save PyTorch model to {}".format(pytorch_dump_path))
+    torch.save(model.state_dict(), pytorch_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--tf_checkpoint_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the TensorFlow checkpoint path.")
+    parser.add_argument("--bert_config_file",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "The config json file corresponding to the pre-trained BERT model. \n"
+                            "This specifies the model architecture.")
+    parser.add_argument("--pytorch_dump_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the output PyTorch model.")
+    args = parser.parse_args()
+    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path,
+                                     args.bert_config_file,
+                                     args.pytorch_dump_path)
diff --git a/Optimus/code/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py b/Optimus/code/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
new file mode 100755
index 0000000000000000000000000000000000000000..b310b73453c9c6d927dd9bd97d0062c23f548e3b
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
@@ -0,0 +1,117 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert Transformer XL checkpoint and datasets."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import os
+import sys
+from io import open
+
+import torch
+
+import pytorch_transformers.tokenization_transfo_xl as data_utils
+
+from pytorch_transformers import CONFIG_NAME, WEIGHTS_NAME
+from pytorch_transformers import (TransfoXLConfig, TransfoXLLMHeadModel,
+                                                      load_tf_weights_in_transfo_xl)
+from pytorch_transformers.tokenization_transfo_xl import (CORPUS_NAME, VOCAB_FILES_NAMES)
+
+if sys.version_info[0] == 2:
+    import cPickle as pickle
+else:
+    import pickle
+
+import logging
+logging.basicConfig(level=logging.INFO)
+
+# We do this to be able to load python 2 datasets pickles
+# See e.g. https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory/2121918#2121918
+data_utils.Vocab = data_utils.TransfoXLTokenizer
+data_utils.Corpus = data_utils.TransfoXLCorpus
+sys.modules['data_utils'] = data_utils
+sys.modules['vocabulary'] = data_utils
+
+def convert_transfo_xl_checkpoint_to_pytorch(tf_checkpoint_path,
+                                             transfo_xl_config_file,
+                                             pytorch_dump_folder_path,
+                                             transfo_xl_dataset_file):
+    if transfo_xl_dataset_file:
+        # Convert a pre-processed corpus (see original TensorFlow repo)
+        with open(transfo_xl_dataset_file, "rb") as fp:
+            corpus = pickle.load(fp, encoding="latin1")
+        # Save vocabulary and dataset cache as Dictionaries (should be better than pickles for the long-term)
+        pytorch_vocab_dump_path = pytorch_dump_folder_path + '/' + VOCAB_FILES_NAMES['pretrained_vocab_file']
+        print("Save vocabulary to {}".format(pytorch_vocab_dump_path))
+        corpus_vocab_dict = corpus.vocab.__dict__
+        torch.save(corpus_vocab_dict, pytorch_vocab_dump_path)
+
+        corpus_dict_no_vocab = corpus.__dict__
+        corpus_dict_no_vocab.pop('vocab', None)
+        pytorch_dataset_dump_path = pytorch_dump_folder_path + '/' + CORPUS_NAME
+        print("Save dataset to {}".format(pytorch_dataset_dump_path))
+        torch.save(corpus_dict_no_vocab, pytorch_dataset_dump_path)
+
+    if tf_checkpoint_path:
+        # Convert a pre-trained TensorFlow model
+        config_path = os.path.abspath(transfo_xl_config_file)
+        tf_path = os.path.abspath(tf_checkpoint_path)
+
+        print("Converting Transformer XL checkpoint from {} with config at {}".format(tf_path, config_path))
+        # Initialise PyTorch model
+        if transfo_xl_config_file == "":
+            config = TransfoXLConfig()
+        else:
+            config = TransfoXLConfig.from_json_file(transfo_xl_config_file)
+        print("Building PyTorch model from configuration: {}".format(str(config)))
+        model = TransfoXLLMHeadModel(config)
+
+        model = load_tf_weights_in_transfo_xl(model, config, tf_path)
+        # Save pytorch-model
+        pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)
+        pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)
+        print("Save PyTorch model to {}".format(os.path.abspath(pytorch_weights_dump_path)))
+        torch.save(model.state_dict(), pytorch_weights_dump_path)
+        print("Save configuration file to {}".format(os.path.abspath(pytorch_config_dump_path)))
+        with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
+            f.write(config.to_json_string())
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--pytorch_dump_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the folder to store the PyTorch model or dataset/vocab.")
+    parser.add_argument("--tf_checkpoint_path",
+                        default = "",
+                        type = str,
+                        help = "An optional path to a TensorFlow checkpoint path to be converted.")
+    parser.add_argument("--transfo_xl_config_file",
+                        default = "",
+                        type = str,
+                        help = "An optional config json file corresponding to the pre-trained BERT model. \n"
+                            "This specifies the model architecture.")
+    parser.add_argument("--transfo_xl_dataset_file",
+                        default = "",
+                        type = str,
+                        help = "An optional dataset file to be converted in a vocabulary.")
+    args = parser.parse_args()
+    convert_transfo_xl_checkpoint_to_pytorch(args.tf_checkpoint_path,
+                                     args.transfo_xl_config_file,
+                                     args.pytorch_dump_folder_path,
+                                     args.transfo_xl_dataset_file)
diff --git a/Optimus/code/pytorch_transformers/convert_xlm_checkpoint_to_pytorch.py b/Optimus/code/pytorch_transformers/convert_xlm_checkpoint_to_pytorch.py
new file mode 100755
index 0000000000000000000000000000000000000000..d6a3cd89e7efc2fcf0841117150b1abe81cec7e4
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_xlm_checkpoint_to_pytorch.py
@@ -0,0 +1,75 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert OpenAI GPT checkpoint."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import json
+from io import open
+
+import torch
+import numpy
+
+from pytorch_transformers import CONFIG_NAME, WEIGHTS_NAME
+from pytorch_transformers.tokenization_xlm import VOCAB_FILES_NAMES
+
+import logging
+logging.basicConfig(level=logging.INFO)
+
+def convert_xlm_checkpoint_to_pytorch(xlm_checkpoint_path, pytorch_dump_folder_path):
+    # Load checkpoint
+    chkpt = torch.load(xlm_checkpoint_path, map_location='cpu')
+
+    model = chkpt['model']
+
+    config = chkpt['params']
+    config = dict((n, v) for n, v in config.items() if not isinstance(v, (torch.FloatTensor, numpy.ndarray)))
+
+    vocab = chkpt['dico_word2id']
+    vocab = dict((s + '</w>' if s.find('@@') == -1 and i > 13 else s.replace('@@', ''), i) for s, i in vocab.items())
+
+    # Save pytorch-model
+    pytorch_weights_dump_path = pytorch_dump_folder_path + '/' + WEIGHTS_NAME
+    pytorch_config_dump_path = pytorch_dump_folder_path + '/' + CONFIG_NAME
+    pytorch_vocab_dump_path = pytorch_dump_folder_path + '/' +  VOCAB_FILES_NAMES['vocab_file']
+
+    print("Save PyTorch model to {}".format(pytorch_weights_dump_path))
+    torch.save(model, pytorch_weights_dump_path)
+
+    print("Save configuration file to {}".format(pytorch_config_dump_path))
+    with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
+        f.write(json.dumps(config, indent=2) + "\n")
+
+    print("Save vocab file to {}".format(pytorch_config_dump_path))
+    with open(pytorch_vocab_dump_path, "w", encoding="utf-8") as f:
+        f.write(json.dumps(vocab, indent=2) + "\n")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--xlm_checkpoint_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path the official PyTorch dump.")
+    parser.add_argument("--pytorch_dump_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the output PyTorch model.")
+    args = parser.parse_args()
+    convert_xlm_checkpoint_to_pytorch(args.xlm_checkpoint_path, args.pytorch_dump_folder_path)
diff --git a/Optimus/code/pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py b/Optimus/code/pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py
new file mode 100755
index 0000000000000000000000000000000000000000..a36fa514b59b7f104cb694f2c2ec7a5d7d8f1f7a
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py
@@ -0,0 +1,104 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert BERT checkpoint."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import argparse
+import torch
+
+from pytorch_transformers import (CONFIG_NAME, WEIGHTS_NAME,
+                                                    XLNetConfig,
+                                                    XLNetLMHeadModel, XLNetForQuestionAnswering,
+                                                    XLNetForSequenceClassification,
+                                                    load_tf_weights_in_xlnet)
+
+GLUE_TASKS_NUM_LABELS = {
+    "cola": 2,
+    "mnli": 3,
+    "mrpc": 2,
+    "sst-2": 2,
+    "sts-b": 1,
+    "qqp": 2,
+    "qnli": 2,
+    "rte": 2,
+    "wnli": 2,
+}
+
+import logging
+logging.basicConfig(level=logging.INFO)
+
+def convert_xlnet_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_folder_path, finetuning_task=None):
+    # Initialise PyTorch model
+    config = XLNetConfig.from_json_file(bert_config_file)
+
+    finetuning_task = finetuning_task.lower() if finetuning_task is not None else ""
+    if finetuning_task in GLUE_TASKS_NUM_LABELS:
+        print("Building PyTorch XLNetForSequenceClassification model from configuration: {}".format(str(config)))
+        config.finetuning_task = finetuning_task
+        config.num_labels = GLUE_TASKS_NUM_LABELS[finetuning_task]
+        model = XLNetForSequenceClassification(config)
+    elif 'squad' in finetuning_task:
+        config.finetuning_task = finetuning_task
+        model = XLNetForQuestionAnswering(config)
+    else:
+        model = XLNetLMHeadModel(config)
+
+    # Load weights from tf checkpoint
+    load_tf_weights_in_xlnet(model, config, tf_checkpoint_path)
+
+    # Save pytorch-model
+    pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)
+    pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)
+    print("Save PyTorch model to {}".format(os.path.abspath(pytorch_weights_dump_path)))
+    torch.save(model.state_dict(), pytorch_weights_dump_path)
+    print("Save configuration file to {}".format(os.path.abspath(pytorch_config_dump_path)))
+    with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
+        f.write(config.to_json_string())
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--tf_checkpoint_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the TensorFlow checkpoint path.")
+    parser.add_argument("--xlnet_config_file",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "The config json file corresponding to the pre-trained XLNet model. \n"
+                               "This specifies the model architecture.")
+    parser.add_argument("--pytorch_dump_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the folder to store the PyTorch model or dataset/vocab.")
+    parser.add_argument("--finetuning_task",
+                        default = None,
+                        type = str,
+                        help = "Name of a task on which the XLNet TensorFloaw model was fine-tuned")
+    args = parser.parse_args()
+    print(args)
+
+    convert_xlnet_checkpoint_to_pytorch(args.tf_checkpoint_path,
+                                        args.xlnet_config_file,
+                                        args.pytorch_dump_folder_path,
+                                        args.finetuning_task)
diff --git a/Optimus/code/pytorch_transformers/file_utils.py b/Optimus/code/pytorch_transformers/file_utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..3fe7fa891def43dac3423bddd3b8079cf9b11243
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/file_utils.py
@@ -0,0 +1,294 @@
+"""
+Utilities for working with the local dataset cache.
+This file is adapted from the AllenNLP library at https://github.com/allenai/allennlp
+Copyright by the AllenNLP authors.
+"""
+from __future__ import (absolute_import, division, print_function, unicode_literals)
+
+import sys
+import json
+import logging
+import os
+import six
+import shutil
+import tempfile
+import fnmatch
+from functools import wraps
+from hashlib import sha256
+from io import open
+
+import boto3
+from botocore.config import Config
+from botocore.exceptions import ClientError
+import requests
+from tqdm import tqdm
+
+try:
+    from torch.hub import _get_torch_home
+    torch_cache_home = _get_torch_home()
+except ImportError:
+    torch_cache_home = os.path.expanduser(
+        os.getenv('TORCH_HOME', os.path.join(
+            os.getenv('XDG_CACHE_HOME', '~/.cache'), 'torch')))
+default_cache_path = os.path.join(torch_cache_home, 'pytorch_transformers')
+
+try:
+    from urllib.parse import urlparse
+except ImportError:
+    from urlparse import urlparse
+
+try:
+    from pathlib import Path
+    PYTORCH_PRETRAINED_BERT_CACHE = Path(
+        os.getenv('PYTORCH_TRANSFORMERS_CACHE', os.getenv('PYTORCH_PRETRAINED_BERT_CACHE', default_cache_path)))
+except (AttributeError, ImportError):
+    PYTORCH_PRETRAINED_BERT_CACHE = os.getenv('PYTORCH_TRANSFORMERS_CACHE',
+                                              os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
+                                                        default_cache_path))
+
+PYTORCH_TRANSFORMERS_CACHE = PYTORCH_PRETRAINED_BERT_CACHE  # Kept for backward compatibility
+
+WEIGHTS_NAME = "pytorch_model.bin"
+TF_WEIGHTS_NAME = 'model.ckpt'
+CONFIG_NAME = "config.json"
+
+logger = logging.getLogger(__name__)  # pylint: disable=invalid-name
+
+if not six.PY2:
+    def add_start_docstrings(*docstr):
+        def docstring_decorator(fn):
+            fn.__doc__ = ''.join(docstr) + fn.__doc__
+            return fn
+        return docstring_decorator
+
+    def add_end_docstrings(*docstr):
+        def docstring_decorator(fn):
+            fn.__doc__ = fn.__doc__ + ''.join(docstr)
+            return fn
+        return docstring_decorator
+else:
+    # Not possible to update class docstrings on python2
+    def add_start_docstrings(*docstr):
+        def docstring_decorator(fn):
+            return fn
+        return docstring_decorator
+
+    def add_end_docstrings(*docstr):
+        def docstring_decorator(fn):
+            return fn
+        return docstring_decorator
+
+def url_to_filename(url, etag=None):
+    """
+    Convert `url` into a hashed filename in a repeatable way.
+    If `etag` is specified, append its hash to the url's, delimited
+    by a period.
+    """
+    url_bytes = url.encode('utf-8')
+    url_hash = sha256(url_bytes)
+    filename = url_hash.hexdigest()
+
+    if etag:
+        etag_bytes = etag.encode('utf-8')
+        etag_hash = sha256(etag_bytes)
+        filename += '.' + etag_hash.hexdigest()
+
+    return filename
+
+
+def filename_to_url(filename, cache_dir=None):
+    """
+    Return the url and etag (which may be ``None``) stored for `filename`.
+    Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.
+    """
+    if cache_dir is None:
+        cache_dir = PYTORCH_TRANSFORMERS_CACHE
+    if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
+        cache_dir = str(cache_dir)
+
+    cache_path = os.path.join(cache_dir, filename)
+    if not os.path.exists(cache_path):
+        raise EnvironmentError("file {} not found".format(cache_path))
+
+    meta_path = cache_path + '.json'
+    if not os.path.exists(meta_path):
+        raise EnvironmentError("file {} not found".format(meta_path))
+
+    with open(meta_path, encoding="utf-8") as meta_file:
+        metadata = json.load(meta_file)
+    url = metadata['url']
+    etag = metadata['etag']
+
+    return url, etag
+
+
+def cached_path(url_or_filename, cache_dir=None, force_download=False, proxies=None):
+    """
+    Given something that might be a URL (or might be a local path),
+    determine which. If it's a URL, download the file and cache it, and
+    return the path to the cached file. If it's already a local path,
+    make sure the file exists and then return the path.
+    Args:
+        cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).
+        force_download: if True, re-dowload the file even if it's already cached in the cache dir.
+    """
+    if cache_dir is None:
+        cache_dir = PYTORCH_TRANSFORMERS_CACHE
+    if sys.version_info[0] == 3 and isinstance(url_or_filename, Path):
+        url_or_filename = str(url_or_filename)
+    if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
+        cache_dir = str(cache_dir)
+
+    parsed = urlparse(url_or_filename)
+
+    if parsed.scheme in ('http', 'https', 's3'):
+        # URL, so get it from the cache (downloading if necessary)
+        return get_from_cache(url_or_filename, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
+    elif os.path.exists(url_or_filename):
+        # File, and it exists.
+        return url_or_filename
+    elif parsed.scheme == '':
+        # File, but it doesn't exist.
+        raise EnvironmentError("file {} not found".format(url_or_filename))
+    else:
+        # Something unknown
+        raise ValueError("unable to parse {} as a URL or as a local path".format(url_or_filename))
+
+
+def split_s3_path(url):
+    """Split a full s3 path into the bucket name and path."""
+    parsed = urlparse(url)
+    if not parsed.netloc or not parsed.path:
+        raise ValueError("bad s3 path {}".format(url))
+    bucket_name = parsed.netloc
+    s3_path = parsed.path
+    # Remove '/' at beginning of path.
+    if s3_path.startswith("/"):
+        s3_path = s3_path[1:]
+    return bucket_name, s3_path
+
+
+def s3_request(func):
+    """
+    Wrapper function for s3 requests in order to create more helpful error
+    messages.
+    """
+
+    @wraps(func)
+    def wrapper(url, *args, **kwargs):
+        try:
+            return func(url, *args, **kwargs)
+        except ClientError as exc:
+            if int(exc.response["Error"]["Code"]) == 404:
+                raise EnvironmentError("file {} not found".format(url))
+            else:
+                raise
+
+    return wrapper
+
+
+@s3_request
+def s3_etag(url, proxies=None):
+    """Check ETag on S3 object."""
+    s3_resource = boto3.resource("s3", config=Config(proxies=proxies))
+    bucket_name, s3_path = split_s3_path(url)
+    s3_object = s3_resource.Object(bucket_name, s3_path)
+    return s3_object.e_tag
+
+
+@s3_request
+def s3_get(url, temp_file, proxies=None):
+    """Pull a file directly from S3."""
+    s3_resource = boto3.resource("s3", config=Config(proxies=proxies))
+    bucket_name, s3_path = split_s3_path(url)
+    s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file)
+
+
+def http_get(url, temp_file, proxies=None):
+    req = requests.get(url, stream=True, proxies=proxies)
+    content_length = req.headers.get('Content-Length')
+    total = int(content_length) if content_length is not None else None
+    progress = tqdm(unit="B", total=total)
+    for chunk in req.iter_content(chunk_size=1024):
+        if chunk: # filter out keep-alive new chunks
+            progress.update(len(chunk))
+            temp_file.write(chunk)
+    progress.close()
+
+
+def get_from_cache(url, cache_dir=None, force_download=False, proxies=None):
+    """
+    Given a URL, look for the corresponding dataset in the local cache.
+    If it's not there, download it. Then return the path to the cached file.
+    """
+    if cache_dir is None:
+        cache_dir = PYTORCH_TRANSFORMERS_CACHE
+    if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
+        cache_dir = str(cache_dir)
+    if sys.version_info[0] == 2 and not isinstance(cache_dir, str):
+        cache_dir = str(cache_dir)
+
+    if not os.path.exists(cache_dir):
+        os.makedirs(cache_dir)
+
+    # Get eTag to add to filename, if it exists.
+    if url.startswith("s3://"):
+        etag = s3_etag(url, proxies=proxies)
+    else:
+        try:
+            response = requests.head(url, allow_redirects=True, proxies=proxies)
+            if response.status_code != 200:
+                etag = None
+            else:
+                etag = response.headers.get("ETag")
+        except EnvironmentError:
+            etag = None
+
+    if sys.version_info[0] == 2 and etag is not None:
+        etag = etag.decode('utf-8')
+    filename = url_to_filename(url, etag)
+
+    # get cache path to put the file
+    cache_path = os.path.join(cache_dir, filename)
+
+    # If we don't have a connection (etag is None) and can't identify the file
+    # try to get the last downloaded one
+    if not os.path.exists(cache_path) and etag is None:
+        matching_files = fnmatch.filter(os.listdir(cache_dir), filename + '.*')
+        matching_files = list(filter(lambda s: not s.endswith('.json'), matching_files))
+        if matching_files:
+            cache_path = os.path.join(cache_dir, matching_files[-1])
+
+    if not os.path.exists(cache_path) or force_download:
+        # Download to temporary file, then copy to cache dir once finished.
+        # Otherwise you get corrupt cache entries if the download gets interrupted.
+        with tempfile.NamedTemporaryFile() as temp_file:
+            logger.info("%s not found in cache or force_download set to True, downloading to %s", url, temp_file.name)
+
+            # GET file object
+            if url.startswith("s3://"):
+                s3_get(url, temp_file, proxies=proxies)
+            else:
+                http_get(url, temp_file, proxies=proxies)
+
+            # we are copying the file before closing it, so flush to avoid truncation
+            temp_file.flush()
+            # shutil.copyfileobj() starts at the current position, so go to the start
+            temp_file.seek(0)
+
+            logger.info("copying %s to cache at %s", temp_file.name, cache_path)
+            with open(cache_path, 'wb') as cache_file:
+                shutil.copyfileobj(temp_file, cache_file)
+
+            logger.info("creating metadata file for %s", cache_path)
+            meta = {'url': url, 'etag': etag}
+            meta_path = cache_path + '.json'
+            with open(meta_path, 'w') as meta_file:
+                output_string = json.dumps(meta)
+                if sys.version_info[0] == 2 and isinstance(output_string, str):
+                    output_string = unicode(output_string, 'utf-8')  # The beauty of python 2
+                meta_file.write(output_string)
+
+            logger.info("removing temp file %s", temp_file.name)
+
+    return cache_path
diff --git a/Optimus/code/pytorch_transformers/modeling_auto.py b/Optimus/code/pytorch_transformers/modeling_auto.py
new file mode 100755
index 0000000000000000000000000000000000000000..31c8fafaa90fa03810a382ca85bb2d26600e4869
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_auto.py
@@ -0,0 +1,497 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Model class. """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+
+from .modeling_bert import BertModel, BertForMaskedLM, BertForSequenceClassification, BertForQuestionAnswering
+from .modeling_openai import OpenAIGPTModel, OpenAIGPTLMHeadModel
+from .modeling_gpt2 import GPT2Model, GPT2LMHeadModel
+from .modeling_transfo_xl import TransfoXLModel, TransfoXLLMHeadModel
+from .modeling_xlnet import XLNetModel, XLNetLMHeadModel, XLNetForSequenceClassification, XLNetForQuestionAnswering
+from .modeling_xlm import XLMModel, XLMWithLMHeadModel, XLMForSequenceClassification, XLMForQuestionAnswering
+from .modeling_roberta import RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification
+from .modeling_distilbert import DistilBertModel, DistilBertForQuestionAnswering, DistilBertForMaskedLM, DistilBertForSequenceClassification
+
+from .modeling_utils import PreTrainedModel, SequenceSummary
+
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+
+class AutoModel(object):
+    r"""
+        :class:`~pytorch_transformers.AutoModel` is a generic model class
+        that will be instantiated as one of the base model classes of the library
+        when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertModel (DistilBERT model)
+            - contains `roberta`: RobertaModel (RoBERTa model)
+            - contains `bert`: BertModel (Bert model)
+            - contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
+            - contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
+            - contains `xlnet`: XLNetModel (XLNet model)
+            - contains `xlm`: XLMModel (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throws an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoModel is designed to be instantiated "
+            "using the `AutoModel.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r""" Instantiates one of the base model classes of the library
+        from a pre-trained model configuration.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertModel (DistilBERT model)
+            - contains `roberta`: RobertaModel (RoBERTa model)
+            - contains `bert`: BertModel (Bert model)
+            - contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
+            - contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
+            - contains `xlnet`: XLNetModel (XLNet model)
+            - contains `xlm`: XLMModel (XLM model)
+
+            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
+            To train the model, you should first set it back in training mode with `model.train()`
+
+        Params:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
+                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+
+            model_args: (`optional`) Sequence of positional arguments:
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+
+            config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
+                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
+
+                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
+                - the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
+                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
+
+            state_dict: (`optional`) dict:
+                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
+                In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            output_loading_info: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+
+            kwargs: (`optional`) Remaining dictionary of keyword arguments:
+                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
+
+                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
+                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+
+        Examples::
+
+            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = AutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+
+        """
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
+            return BertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            return OpenAIGPTModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'gpt2' in pretrained_model_name_or_path:
+            return GPT2Model.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            return TransfoXLModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
+
+
+class AutoModelWithLMHead(object):
+    r"""
+        :class:`~pytorch_transformers.AutoModelWithLMHead` is a generic model class
+        that will be instantiated as one of the language modeling model classes of the library
+        when created with the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
+            - contains `roberta`: RobertaForMaskedLM (RoBERTa model)
+            - contains `bert`: BertForMaskedLM (Bert model)
+            - contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
+            - contains `gpt2`: GPT2LMHeadModel (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLLMHeadModel (Transformer-XL model)
+            - contains `xlnet`: XLNetLMHeadModel (XLNet model)
+            - contains `xlm`: XLMWithLMHeadModel (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throws an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoModelWithLMHead is designed to be instantiated "
+            "using the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r""" Instantiates one of the language modeling model classes of the library
+        from a pre-trained model configuration.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
+            - contains `roberta`: RobertaForMaskedLM (RoBERTa model)
+            - contains `bert`: BertForMaskedLM (Bert model)
+            - contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
+            - contains `gpt2`: GPT2LMHeadModel (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLLMHeadModel (Transformer-XL model)
+            - contains `xlnet`: XLNetLMHeadModel (XLNet model)
+            - contains `xlm`: XLMWithLMHeadModel (XLM model)
+
+        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
+        To train the model, you should first set it back in training mode with `model.train()`
+
+        Params:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
+                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+
+            model_args: (`optional`) Sequence of positional arguments:
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+
+            config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
+                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
+
+                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
+                - the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
+                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
+
+            state_dict: (`optional`) dict:
+                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
+                In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            output_loading_info: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+
+            kwargs: (`optional`) Remaining dictionary of keyword arguments:
+                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
+
+                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
+                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+
+        Examples::
+
+            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = AutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            model = AutoModelWithLMHead.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+
+        """
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
+            return BertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            return OpenAIGPTLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'gpt2' in pretrained_model_name_or_path:
+            return GPT2LMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            return TransfoXLLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMWithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
+
+
+class AutoModelForSequenceClassification(object):
+    r"""
+        :class:`~pytorch_transformers.AutoModelForSequenceClassification` is a generic model class
+        that will be instantiated as one of the sequence classification model classes of the library
+        when created with the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertForSequenceClassification (DistilBERT model)
+            - contains `roberta`: RobertaForSequenceClassification (RoBERTa model)
+            - contains `bert`: BertForSequenceClassification (Bert model)
+            - contains `xlnet`: XLNetForSequenceClassification (XLNet model)
+            - contains `xlm`: XLMForSequenceClassification (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throws an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoModelWithLMHead is designed to be instantiated "
+            "using the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r""" Instantiates one of the sequence classification model classes of the library
+        from a pre-trained model configuration.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertForSequenceClassification (DistilBERT model)
+            - contains `roberta`: RobertaForSequenceClassification (RoBERTa model)
+            - contains `bert`: BertForSequenceClassification (Bert model)
+            - contains `xlnet`: XLNetForSequenceClassification (XLNet model)
+            - contains `xlm`: XLMForSequenceClassification (XLM model)
+
+        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
+        To train the model, you should first set it back in training mode with `model.train()`
+
+        Params:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
+                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+
+            model_args: (`optional`) Sequence of positional arguments:
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+
+            config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
+                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
+
+                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
+                - the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
+                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
+
+            state_dict: (`optional`) dict:
+                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
+                In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            output_loading_info: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+
+            kwargs: (`optional`) Remaining dictionary of keyword arguments:
+                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
+
+                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
+                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+
+        Examples::
+
+            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = AutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            model = AutoModelForSequenceClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+
+        """
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
+            return BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'xlnet', 'xlm', 'roberta'".format(pretrained_model_name_or_path))
+
+
+class AutoModelForQuestionAnswering(object):
+    r"""
+        :class:`~pytorch_transformers.AutoModelForQuestionAnswering` is a generic model class
+        that will be instantiated as one of the question answering model classes of the library
+        when created with the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertForQuestionAnswering (DistilBERT model)
+            - contains `bert`: BertForQuestionAnswering (Bert model)
+            - contains `xlnet`: XLNetForQuestionAnswering (XLNet model)
+            - contains `xlm`: XLMForQuestionAnswering (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throws an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoModelWithLMHead is designed to be instantiated "
+            "using the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r""" Instantiates one of the question answering model classes of the library
+        from a pre-trained model configuration.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertForQuestionAnswering (DistilBERT model)
+            - contains `bert`: BertForQuestionAnswering (Bert model)
+            - contains `xlnet`: XLNetForQuestionAnswering (XLNet model)
+            - contains `xlm`: XLMForQuestionAnswering (XLM model)
+
+        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
+        To train the model, you should first set it back in training mode with `model.train()`
+
+        Params:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
+                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+
+            model_args: (`optional`) Sequence of positional arguments:
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+
+            config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
+                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
+
+                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
+                - the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
+                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
+
+            state_dict: (`optional`) dict:
+                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
+                In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            output_loading_info: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+
+            kwargs: (`optional`) Remaining dictionary of keyword arguments:
+                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
+
+                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
+                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+
+        Examples::
+
+            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = AutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            model = AutoModelForQuestionAnswering.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+
+        """
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertForQuestionAnswering.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
+            return BertForQuestionAnswering.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetForQuestionAnswering.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMForQuestionAnswering.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'xlnet', 'xlm'".format(pretrained_model_name_or_path))
diff --git a/Optimus/code/pytorch_transformers/modeling_bert.py b/Optimus/code/pytorch_transformers/modeling_bert.py
new file mode 100755
index 0000000000000000000000000000000000000000..c85452027edd1f6effd09dca545f980b5396687a
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_bert.py
@@ -0,0 +1,1341 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch BERT model. """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import math
+import os
+import sys
+from io import open
+
+import pdb
+
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss, MSELoss
+
+from .modeling_utils import PreTrainedModel, prune_linear_layer
+from .configuration_bert import BertConfig
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+BERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin",
+    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin",
+    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-pytorch_model.bin",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-pytorch_model.bin",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin",
+    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin",
+    'bert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-pytorch_model.bin",
+    'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin",
+    'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-pytorch_model.bin",
+    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin",
+}
+
+def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
+    """ Load tf checkpoints in a pytorch model.
+    """
+    try:
+        import re
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions.")
+        raise
+    tf_path = os.path.abspath(tf_checkpoint_path)
+    logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    arrays = []
+    for name, shape in init_vars:
+        logger.info("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        arrays.append(array)
+
+    for name, array in zip(names, arrays):
+        name = name.split('/')
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if any(n in ["adam_v", "adam_m", "global_step"] for n in name):
+            logger.info("Skipping {}".format("/".join(name)))
+            continue
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
+                l = re.split(r'_(\d+)', m_name)
+            else:
+                l = [m_name]
+            if l[0] == 'kernel' or l[0] == 'gamma':
+                pointer = getattr(pointer, 'weight')
+            elif l[0] == 'output_bias' or l[0] == 'beta':
+                pointer = getattr(pointer, 'bias')
+            elif l[0] == 'output_weights':
+                pointer = getattr(pointer, 'weight')
+            elif l[0] == 'squad':
+                pointer = getattr(pointer, 'classifier')
+            else:
+                try:
+                    pointer = getattr(pointer, l[0])
+                except AttributeError:
+                    logger.info("Skipping {}".format("/".join(name)))
+                    continue
+            if len(l) >= 2:
+                num = int(l[1])
+                pointer = pointer[num]
+        if m_name[-11:] == '_embeddings':
+            pointer = getattr(pointer, 'weight')
+        elif m_name == 'kernel':
+            array = np.transpose(array)
+        try:
+            assert pointer.shape == array.shape
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        logger.info("Initialize PyTorch weight {}".format(name))
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+def gelu(x):
+    """Implementation of the gelu activation function.
+        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
+
+
+try:
+    from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm
+except (ImportError, AttributeError) as e:
+    logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
+    BertLayerNorm = torch.nn.LayerNorm
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings.
+    """
+    def __init__(self, config):
+        super(BertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        seq_length = input_ids.size(1)
+        if position_ids is None:
+            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = words_embeddings + position_embeddings + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertSelfAttention(nn.Module):
+    def __init__(self, config):
+        super(BertSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
+        self.output_attentions = config.output_attentions
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(self, hidden_states, attention_mask, head_mask=None):
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+        attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)
+        return outputs
+
+
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super(BertSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertAttention(nn.Module):
+    def __init__(self, config):
+        super(BertAttention, self).__init__()
+        self.self = BertSelfAttention(config)
+        self.output = BertSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)
+        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads
+        for head in heads:
+            # Compute how many pruned heads are before the head and move the index accordingly
+            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(self, input_tensor, attention_mask, head_mask=None):
+        self_outputs = self.self(input_tensor, attention_mask, head_mask)
+        attention_output = self.output(self_outputs[0], input_tensor)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super(BertIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super(BertOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertLayer(nn.Module):
+    def __init__(self, config):
+        super(BertLayer, self).__init__()
+        self.attention = BertAttention(config)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(self, hidden_states, attention_mask, head_mask=None):
+        attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
+        attention_output = attention_outputs[0]
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        outputs = (layer_output,) + attention_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class BertEncoder(nn.Module):
+    def __init__(self, config):
+        super(BertEncoder, self).__init__()
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(self, hidden_states, attention_mask, head_mask=None):
+        all_hidden_states = ()
+        all_attentions = ()
+        for i, layer_module in enumerate(self.layer):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
+            hidden_states = layer_outputs[0]
+
+            if self.output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        # Add last layer
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        outputs = (hidden_states,)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+        if self.output_attentions:
+            outputs = outputs + (all_attentions,)
+        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
+
+
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super(BertPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super(BertPredictionHeadTransform, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class BertLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super(BertLMPredictionHead, self).__init__()
+        self.transform = BertPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size,
+                                 config.vocab_size,
+                                 bias=False)
+
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states) + self.bias
+        return hidden_states
+
+
+class BertOnlyMLMHead(nn.Module):
+    def __init__(self, config):
+        super(BertOnlyMLMHead, self).__init__()
+        self.predictions = BertLMPredictionHead(config)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class BertOnlyNSPHead(nn.Module):
+    def __init__(self, config):
+        super(BertOnlyNSPHead, self).__init__()
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, pooled_output):
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return seq_relationship_score
+
+
+class BertPreTrainingHeads(nn.Module):
+    def __init__(self, config):
+        super(BertPreTrainingHeads, self).__init__()
+        self.predictions = BertLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class BertPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = BertConfig
+    pretrained_model_archive_map = BERT_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = load_tf_weights_in_bert
+    base_model_prefix = "bert"
+
+    def _init_weights(self, module):
+        """ Initialize the weights """
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, BertLayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+
+BERT_START_DOCSTRING = r"""    The BERT model was proposed in
+    `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_
+    by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It's a bidirectional transformer
+    pre-trained using a combination of masked language modeling objective and next sentence prediction
+    on a large corpus comprising the Toronto Book Corpus and Wikipedia.
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`:
+        https://arxiv.org/abs/1810.04805
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~pytorch_transformers.BertConfig`): Model configuration class with all the parameters of the model. 
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+BERT_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            To match pre-training, BERT input sequence should be formatted with [CLS] and [SEP] tokens as follows:
+
+            (a) For sequence pairs:
+
+                ``tokens:         [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
+                
+                ``token_type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1``
+
+            (b) For single sequences:
+
+                ``tokens:         [CLS] the dog is hairy . [SEP]``
+                
+                ``token_type_ids:   0   0   0   0  0     0   0``
+
+            Bert is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+
+            Indices can be obtained using :class:`pytorch_transformers.BertTokenizer`.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Segment token indices to indicate first and second portions of the inputs.
+            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
+            corresponds to a `sentence B` token
+            (see `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
+        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare Bert Model transformer outputting raw hidden-states without any specific head on top.",
+                      BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertModel(BertPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the output of the last layer of the model.
+        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
+            Last layer hidden-state of the first token of the sequence (classification token)
+            further processed by a Linear layer and a Tanh activation function. The Linear
+            layer weights are trained from the next sentence prediction (classification)
+            objective during Bert pretraining. This output is usually *not* a good summary
+            of the semantic content of the input, you're often better with averaging or pooling
+            the sequence of hidden-states for the whole input sequence.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertModel.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config):
+        super(BertModel, self).__init__(config)
+
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+        self.pooler = BertPooler(config)
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        old_embeddings = self.embeddings.word_embeddings
+        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
+        self.embeddings.word_embeddings = new_embeddings
+        return self.embeddings.word_embeddings
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+            See base class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        # We create a 3D attention mask from a 2D tensor mask.
+        # Sizes are [batch_size, 1, 1, to_seq_length]
+        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+        # this attention mask is more simple than the triangular masking of causal attention
+        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(embedding_output,
+                                       extended_attention_mask,
+                                       head_mask=head_mask)
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+
+        outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]  # add hidden_states and attentions if they are here
+        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)
+
+
+
+
+
+
+@add_start_docstrings("The bare Bert Model transformer outputting raw hidden-states without any specific head on top.",
+                      BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForLatentConnector(BertPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the output of the last layer of the model.
+        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
+            Last layer hidden-state of the first token of the sequence (classification token)
+            further processed by a Linear layer and a Tanh activation function. The Linear
+            layer weights are trained from the next sentence prediction (classification)
+            objective during Bert pretraining. This output is usually *not* a good summary
+            of the semantic content of the input, you're often better with averaging or pooling
+            the sequence of hidden-states for the whole input sequence.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertModel.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config, latent_size):
+        super(BertForLatentConnector, self).__init__(config)
+
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+        self.pooler = BertPooler(config)
+
+        self.linear = nn.Linear(config.hidden_size, 2 * latent_size, bias=False)
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        old_embeddings = self.embeddings.word_embeddings
+        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
+        self.embeddings.word_embeddings = new_embeddings
+        return self.embeddings.word_embeddings
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+            See base class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        # We create a 3D attention mask from a 2D tensor mask.
+        # Sizes are [batch_size, 1, 1, to_seq_length]
+        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+        # this attention mask is more simple than the triangular masking of causal attention
+        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(embedding_output,
+                                       extended_attention_mask,
+                                       head_mask=head_mask)
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+
+        outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]  # add hidden_states and attentions if they are here
+        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)
+
+
+
+@add_start_docstrings("""Bert Model with two heads on top as done during the pre-training:
+    a `masked language modeling` head and a `next sentence prediction (classification)` head. """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForPreTraining(BertPreTrainedModel):
+    r"""
+        **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+        **next_sentence_label**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
+            Indices should be in ``[0, 1]``.
+            ``0`` indicates sequence B is a continuation of sequence A,
+            ``1`` indicates sequence B is a random sequence.
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when both ``masked_lm_labels`` and ``next_sentence_label`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **seq_relationship_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, 2)``
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForPreTraining.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        prediction_scores, seq_relationship_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(BertForPreTraining, self).__init__(config)
+
+        self.bert = BertModel(config)
+        self.cls = BertPreTrainingHeads(config)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.cls.predictions.decoder,
+                                   self.bert.embeddings.word_embeddings)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                masked_lm_labels=None, next_sentence_label=None):
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids, 
+                            head_mask=head_mask)
+
+        sequence_output, pooled_output = outputs[:2]
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        outputs = (prediction_scores, seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if masked_lm_labels is not None and next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            total_loss = masked_lm_loss + next_sentence_loss
+            outputs = (total_loss,) + outputs
+
+        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""Bert Model with a `language modeling` head on top. """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForMaskedLM(BertPreTrainedModel):
+    r"""
+        **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Masked language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForMaskedLM.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, masked_lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(BertForMaskedLM, self).__init__(config)
+
+        self.bert = BertModel(config)
+        self.cls = BertOnlyMLMHead(config)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.cls.predictions.decoder,
+                                   self.bert.embeddings.word_embeddings)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                masked_lm_labels=None):
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids, 
+                            head_mask=head_mask)
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here
+        if masked_lm_labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            outputs = (masked_lm_loss,) + outputs
+
+        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""Bert Model with a `next sentence prediction (classification)` head on top. """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForNextSentencePrediction(BertPreTrainedModel):
+    r"""
+        **next_sentence_label**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
+            Indices should be in ``[0, 1]``.
+            ``0`` indicates sequence B is a continuation of sequence A,
+            ``1`` indicates sequence B is a random sequence.
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``next_sentence_label`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Next sequence prediction (classification) loss.
+        **seq_relationship_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, 2)``
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        seq_relationship_scores = outputs[0]
+
+    """
+    def __init__(self, config):
+        super(BertForNextSentencePrediction, self).__init__(config)
+
+        self.bert = BertModel(config)
+        self.cls = BertOnlyNSPHead(config)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                next_sentence_label=None):
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids, 
+                            head_mask=head_mask)
+
+        pooled_output = outputs[1]
+
+        seq_relationship_score = self.cls(pooled_output)
+
+        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here
+        if next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
+            outputs = (next_sentence_loss,) + outputs
+
+        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of
+    the pooled output) e.g. for GLUE tasks. """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForSequenceClassification(BertPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(BertForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
+        self.use_freeze = False
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None,
+                position_ids=None, head_mask=None, labels=None):
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids, 
+                            head_mask=head_mask)
+
+        pooled_output = outputs[1]
+
+        if self.use_freeze:
+            pooled_output = pooled_output.detach()
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+        
+        # pdb.set_trace()
+        return outputs, pooled_output  # (loss), logits, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of
+    the pooled output) e.g. for GLUE tasks. """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForSequenceClassificationLatentConnector(BertPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForSequenceClassificationLatentConnector.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config, latent_size):
+        super(BertForSequenceClassificationLatentConnector, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        
+        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
+        self.linear = nn.Linear(config.hidden_size, 2 * latent_size, bias=False)
+        self.use_freeze = False
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None,
+                position_ids=None, head_mask=None, labels=None):
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids, 
+                            head_mask=head_mask)
+
+
+        pooled_output = outputs[1]
+        # mean, logvar = self.linear(pooled_output).chunk(2, -1)
+
+        if self.use_freeze:
+            pooled_output = pooled_output.detach()
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs, pooled_output   # (loss), logits, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""Bert Model with a multiple choice classification head on top (a linear layer on top of
+    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForMultipleChoice(BertPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above)
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification loss.
+        **classification_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above).
+            Classification scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')
+        choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, classification_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(BertForMultipleChoice, self).__init__(config)
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None,
+                position_ids=None, head_mask=None, labels=None):
+        num_choices = input_ids.shape[1]
+
+        input_ids = input_ids.view(-1, input_ids.size(-1))
+        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids,
+                            head_mask=head_mask)
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+
+        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""Bert Model with a token classification head on top (a linear layer on top of
+    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForTokenClassification(BertPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for computing the token classification loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification loss.
+        **scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.num_labels)``
+            Classification scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForTokenClassification.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(BertForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None,
+                position_ids=None, head_mask=None, labels=None):
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids, 
+                            head_mask=head_mask)
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)[active_loss]
+                active_labels = labels.view(-1)[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), scores, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
+class BertForQuestionAnswering(BertPreTrainedModel):
+    r"""
+        **start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        **end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        **start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-start scores (before SoftMax).
+        **end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-end scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(BertForQuestionAnswering, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bert = BertModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                start_positions=None, end_positions=None):
+
+        outputs = self.bert(input_ids,
+                            attention_mask=attention_mask,
+                            token_type_ids=token_type_ids,
+                            position_ids=position_ids, 
+                            head_mask=head_mask)
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        outputs = (start_logits, end_logits,) + outputs[2:]
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions.clamp_(0, ignored_index)
+            end_positions.clamp_(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+            outputs = (total_loss,) + outputs
+
+        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
diff --git a/Optimus/code/pytorch_transformers/modeling_distilbert.py b/Optimus/code/pytorch_transformers/modeling_distilbert.py
new file mode 100755
index 0000000000000000000000000000000000000000..c5cc44be7506fad9a0539c9868e8233c8da704d2
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_distilbert.py
@@ -0,0 +1,695 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch DistilBERT model
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
+    and in part from HuggingFace PyTorch version of Google AI Bert model (https://github.com/google-research/bert)
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import math
+import copy
+import sys
+from io import open
+
+import itertools
+import numpy as np
+
+import torch
+import torch.nn as nn
+
+from .modeling_utils import PreTrainedModel, prune_linear_layer
+from .configuration_distilbert import DistilBertConfig
+from .file_utils import add_start_docstrings
+
+import logging
+logger = logging.getLogger(__name__)
+
+
+DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin",
+    'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-pytorch_model.bin"
+}
+
+
+### UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE ###
+def gelu(x):
+    return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+def create_sinusoidal_embeddings(n_pos, dim, out):
+    position_enc = np.array([
+        [pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
+        for pos in range(n_pos)
+    ])
+    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
+    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
+    out.detach_()
+    out.requires_grad = False
+
+class Embeddings(nn.Module):
+    def __init__(self,
+                 config):
+        super(Embeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=0)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
+        if config.sinusoidal_pos_embds:
+            create_sinusoidal_embeddings(n_pos=config.max_position_embeddings,
+                                         dim=config.dim,
+                                         out=self.position_embeddings.weight)
+
+        self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)
+        self.dropout = nn.Dropout(config.dropout)
+
+    def forward(self, input_ids):
+        """
+        Parameters
+        ----------
+        input_ids: torch.tensor(bs, max_seq_length)
+            The token ids to embed.
+
+        Outputs
+        -------
+        embeddings: torch.tensor(bs, max_seq_length, dim)
+            The embedded tokens (plus position embeddings, no token_type embeddings)
+        """
+        seq_length = input_ids.size(1)
+        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device) # (max_seq_length)
+        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)                      # (bs, max_seq_length)
+
+        word_embeddings = self.word_embeddings(input_ids)                   # (bs, max_seq_length, dim)
+        position_embeddings = self.position_embeddings(position_ids)        # (bs, max_seq_length, dim)
+
+        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)
+        embeddings = self.LayerNorm(embeddings)             # (bs, max_seq_length, dim)
+        embeddings = self.dropout(embeddings)               # (bs, max_seq_length, dim)
+        return embeddings
+
+class MultiHeadSelfAttention(nn.Module):
+    def __init__(self, config):
+        super(MultiHeadSelfAttention, self).__init__()
+
+        self.n_heads = config.n_heads
+        self.dim = config.dim
+        self.dropout = nn.Dropout(p=config.attention_dropout)
+        self.output_attentions = config.output_attentions
+
+        assert self.dim % self.n_heads == 0
+
+        self.q_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
+        self.k_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
+        self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
+        self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
+
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        attention_head_size = self.dim // self.n_heads
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.n_heads, attention_head_size)
+        heads = set(heads) - self.pruned_heads
+        for head in heads:
+            head -= sum(1 if h < head else 0 for h in self.pruned_heads)
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        # Prune linear layers
+        self.q_lin = prune_linear_layer(self.q_lin, index)
+        self.k_lin = prune_linear_layer(self.k_lin, index)
+        self.v_lin = prune_linear_layer(self.v_lin, index)
+        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)
+        # Update hyper params
+        self.n_heads = self.n_heads - len(heads)
+        self.dim = attention_head_size * self.n_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(self, query, key, value, mask, head_mask = None):
+        """
+        Parameters
+        ----------
+        query: torch.tensor(bs, seq_length, dim)
+        key: torch.tensor(bs, seq_length, dim)
+        value: torch.tensor(bs, seq_length, dim)
+        mask: torch.tensor(bs, seq_length)
+
+        Outputs
+        -------
+        weights: torch.tensor(bs, n_heads, seq_length, seq_length)
+            Attention weights
+        context: torch.tensor(bs, seq_length, dim)
+            Contextualized layer. Optional: only if `output_attentions=True`
+        """
+        bs, q_length, dim = query.size()
+        k_length = key.size(1)
+        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)
+        # assert key.size() == value.size()
+
+        dim_per_head = self.dim // self.n_heads
+
+        assert 2 <= mask.dim() <= 3
+        causal = (mask.dim() == 3)
+        mask_reshp = (bs, 1, 1, k_length)
+
+        def shape(x):
+            """ separate heads """
+            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)
+
+        def unshape(x):
+            """ group heads """
+            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)
+
+        q = shape(self.q_lin(query))           # (bs, n_heads, q_length, dim_per_head)
+        k = shape(self.k_lin(key))             # (bs, n_heads, k_length, dim_per_head)
+        v = shape(self.v_lin(value))           # (bs, n_heads, k_length, dim_per_head)
+
+        q = q / math.sqrt(dim_per_head)                     # (bs, n_heads, q_length, dim_per_head)
+        scores = torch.matmul(q, k.transpose(2,3))          # (bs, n_heads, q_length, k_length)
+        mask = (mask==0).view(mask_reshp).expand_as(scores) # (bs, n_heads, q_length, k_length)
+        scores.masked_fill_(mask, -float('inf'))            # (bs, n_heads, q_length, k_length)
+
+        weights = nn.Softmax(dim=-1)(scores)   # (bs, n_heads, q_length, k_length)
+        weights = self.dropout(weights)        # (bs, n_heads, q_length, k_length)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            weights = weights * head_mask
+
+        context = torch.matmul(weights, v)     # (bs, n_heads, q_length, dim_per_head)
+        context = unshape(context)             # (bs, q_length, dim)
+        context = self.out_lin(context)        # (bs, q_length, dim)
+
+        if self.output_attentions:
+            return (context, weights)
+        else:
+            return (context,)
+
+class FFN(nn.Module):
+    def __init__(self, config):
+        super(FFN, self).__init__()
+        self.dropout = nn.Dropout(p=config.dropout)
+        self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)
+        self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)
+        assert config.activation in ['relu', 'gelu'], "activation ({}) must be in ['relu', 'gelu']".format(config.activation)
+        self.activation = gelu if config.activation == 'gelu' else nn.ReLU()
+
+    def forward(self, input):
+        x = self.lin1(input)
+        x = self.activation(x)
+        x = self.lin2(x)
+        x = self.dropout(x)
+        return x
+
+class TransformerBlock(nn.Module):
+    def __init__(self, config):
+        super(TransformerBlock, self).__init__()
+
+        self.n_heads = config.n_heads
+        self.dim = config.dim
+        self.hidden_dim = config.hidden_dim
+        self.dropout = nn.Dropout(p=config.dropout)
+        self.activation = config.activation
+        self.output_attentions = config.output_attentions
+
+        assert config.dim % config.n_heads == 0
+
+        self.attention = MultiHeadSelfAttention(config)
+        self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
+
+        self.ffn = FFN(config)
+        self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
+
+    def forward(self, x, attn_mask=None, head_mask=None):
+        """
+        Parameters
+        ----------
+        x: torch.tensor(bs, seq_length, dim)
+        attn_mask: torch.tensor(bs, seq_length)
+
+        Outputs
+        -------
+        sa_weights: torch.tensor(bs, n_heads, seq_length, seq_length)
+            The attention weights
+        ffn_output: torch.tensor(bs, seq_length, dim)
+            The output of the transformer block contextualization.
+        """
+        # Self-Attention
+        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask, head_mask=head_mask)
+        if self.output_attentions:
+            sa_output, sa_weights = sa_output                  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)
+        else: # To handle these `output_attention` or `output_hidden_states` cases returning tuples
+            assert type(sa_output) == tuple
+            sa_output = sa_output[0]
+        sa_output = self.sa_layer_norm(sa_output + x)          # (bs, seq_length, dim)
+
+        # Feed Forward Network
+        ffn_output = self.ffn(sa_output)                             # (bs, seq_length, dim)
+        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)
+
+        output = (ffn_output,)
+        if self.output_attentions:
+            output = (sa_weights,) + output
+        return output
+
+
+class Transformer(nn.Module):
+    def __init__(self, config):
+        super(Transformer, self).__init__()
+        self.n_layers = config.n_layers
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        layer = TransformerBlock(config)
+        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])
+
+    def forward(self, x, attn_mask=None, head_mask=None):
+        """
+        Parameters
+        ----------
+        x: torch.tensor(bs, seq_length, dim)
+            Input sequence embedded.
+        attn_mask: torch.tensor(bs, seq_length)
+            Attention mask on the sequence.
+
+        Outputs
+        -------
+        hidden_state: torch.tensor(bs, seq_length, dim)
+            Sequence of hiddens states in the last (top) layer
+        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
+            Tuple of length n_layers with the hidden states from each layer.
+            Optional: only if output_hidden_states=True
+        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
+            Tuple of length n_layers with the attention weights from each layer
+            Optional: only if output_attentions=True
+        """
+        all_hidden_states = ()
+        all_attentions = ()
+
+        hidden_state = x
+        for i, layer_module in enumerate(self.layer):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_state,)
+
+            layer_outputs = layer_module(x=hidden_state,
+                                         attn_mask=attn_mask,
+                                         head_mask=head_mask[i])
+            hidden_state = layer_outputs[-1]
+
+            if self.output_attentions:
+                assert len(layer_outputs) == 2
+                attentions = layer_outputs[0]
+                all_attentions = all_attentions + (attentions,)
+            else:
+                assert len(layer_outputs) == 1
+
+        # Add last layer
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_state,)
+
+        outputs = (hidden_state,)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+        if self.output_attentions:
+            outputs = outputs + (all_attentions,)
+        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
+
+
+### INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL ###
+class DistilBertPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for downloading and loading pretrained models.
+    """
+    config_class = DistilBertConfig
+    pretrained_model_archive_map = DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = None
+    base_model_prefix = "distilbert"
+
+    def __init__(self, *inputs, **kwargs):
+        super(DistilBertPreTrainedModel, self).__init__(*inputs, **kwargs)
+    
+    def _init_weights(self, module):
+        """ Initialize the weights.
+        """
+        if isinstance(module, nn.Embedding):
+            if module.weight.requires_grad:
+                module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+
+DISTILBERT_START_DOCSTRING = r"""
+    DistilBERT is a small, fast, cheap and light Transformer model
+    trained by distilling Bert base. It has 40% less parameters than
+    `bert-base-uncased`, runs 60% faster while preserving over 95% of
+    Bert's performances as measured on the GLUE language understanding benchmark.
+
+    Here are the differences between the interface of Bert and DistilBert:
+
+    - DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
+    - DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
+
+    For more information on DistilBERT, please refer to our
+    `detailed blog post`_
+    
+    .. _`detailed blog post`:
+        https://medium.com/huggingface/distilbert-8cf3380435b5
+
+    Parameters:
+        config (:class:`~pytorch_transformers.DistilBertConfig`): Model configuration class with all the parameters of the model. 
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+DISTILBERT_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids** ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            The input sequences should start with `[CLS]` and end with `[SEP]` tokens.
+            
+            For now, ONLY BertTokenizer(`bert-base-uncased`) is supported and you should use this tokenizer when using DistilBERT.
+        **attention_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare DistilBERT encoder/transformer outputting raw hidden-states without any specific head on top.",
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertModel(DistilBertPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the output of the last layer of the model.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertModel.from_pretrained('distilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config):
+        super(DistilBertModel, self).__init__(config)
+
+        self.embeddings = Embeddings(config)   # Embeddings
+        self.transformer = Transformer(config) # Encoder
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        old_embeddings = self.embeddings.word_embeddings
+        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
+        self.embeddings.word_embeddings = new_embeddings
+        return self.embeddings.word_embeddings
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+            See base class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.transformer.layer[layer].attention.prune_heads(heads)
+
+    def forward(self,
+                input_ids, attention_mask=None, head_mask=None):
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids) # (bs, seq_length)
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        embedding_output = self.embeddings(input_ids)   # (bs, seq_length, dim)
+        tfmr_output = self.transformer(x=embedding_output,
+                                       attn_mask=attention_mask,
+                                       head_mask=head_mask)
+        hidden_state = tfmr_output[0]
+        output = (hidden_state, ) + tfmr_output[1:]
+
+        return output # last-layer hidden-state, (all hidden_states), (all attentions)
+
+
+@add_start_docstrings("""DistilBert Model with a `masked language modeling` head on top. """,
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertForMaskedLM(DistilBertPreTrainedModel):
+    r"""
+        **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Masked language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, masked_lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(DistilBertForMaskedLM, self).__init__(config)
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        self.distilbert = DistilBertModel(config)
+        self.vocab_transform = nn.Linear(config.dim, config.dim)
+        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)
+        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)
+
+        self.init_weights()
+        self.tie_weights()
+
+        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.vocab_projector,
+                                   self.distilbert.embeddings.word_embeddings)
+
+    def forward(self, input_ids, attention_mask=None, head_mask=None, masked_lm_labels=None):
+        dlbrt_output = self.distilbert(input_ids=input_ids,
+                                       attention_mask=attention_mask,
+                                       head_mask=head_mask)
+        hidden_states = dlbrt_output[0]                              # (bs, seq_length, dim)
+        prediction_logits = self.vocab_transform(hidden_states)      # (bs, seq_length, dim)
+        prediction_logits = gelu(prediction_logits)                  # (bs, seq_length, dim)
+        prediction_logits = self.vocab_layer_norm(prediction_logits) # (bs, seq_length, dim)
+        prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)
+
+        outputs = (prediction_logits, ) + dlbrt_output[1:]
+        if masked_lm_labels is not None:
+            mlm_loss = self.mlm_loss_fct(prediction_logits.view(-1, prediction_logits.size(-1)),
+                                         masked_lm_labels.view(-1))
+            outputs = (mlm_loss,) + outputs     
+
+        return outputs # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)
+
+
+@add_start_docstrings("""DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
+                         the pooled output) e.g. for GLUE tasks. """,
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(DistilBertForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.distilbert = DistilBertModel(config)
+        self.pre_classifier = nn.Linear(config.dim, config.dim)
+        self.classifier = nn.Linear(config.dim, config.num_labels)
+        self.dropout = nn.Dropout(config.seq_classif_dropout)
+
+        self.init_weights()
+
+    def forward(self, input_ids,  attention_mask=None, head_mask=None, labels=None):
+        distilbert_output = self.distilbert(input_ids=input_ids,
+                                            attention_mask=attention_mask,
+                                            head_mask=head_mask)
+        hidden_state = distilbert_output[0]                    # (bs, seq_len, dim)
+        pooled_output = hidden_state[:, 0]                    # (bs, dim)
+        pooled_output = self.pre_classifier(pooled_output)   # (bs, dim)
+        pooled_output = nn.ReLU()(pooled_output)             # (bs, dim)
+        pooled_output = self.dropout(pooled_output)         # (bs, dim)
+        logits = self.classifier(pooled_output)              # (bs, dim)
+
+        outputs = (logits,) + distilbert_output[1:]
+        if labels is not None:
+            if self.num_labels == 1:
+                loss_fct = nn.MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), logits, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+                         the hidden-states output to compute `span start logits` and `span end logits`). """,
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
+    r"""
+        **start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        **end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        **start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-start scores (before SoftMax).
+        **end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-end scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(DistilBertForQuestionAnswering, self).__init__(config)
+
+        self.distilbert = DistilBertModel(config)
+        self.qa_outputs = nn.Linear(config.dim, config.num_labels)
+        assert config.num_labels == 2
+        self.dropout = nn.Dropout(config.qa_dropout)
+
+        self.init_weights()
+        
+    def forward(self, input_ids, attention_mask=None, head_mask=None, start_positions=None, end_positions=None):
+        distilbert_output = self.distilbert(input_ids=input_ids,
+                                            attention_mask=attention_mask,
+                                            head_mask=head_mask)
+        hidden_states = distilbert_output[0]                                 # (bs, max_query_len, dim)
+
+        hidden_states = self.dropout(hidden_states)                       # (bs, max_query_len, dim)
+        logits = self.qa_outputs(hidden_states)                           # (bs, max_query_len, 2)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1)                           # (bs, max_query_len)
+        end_logits = end_logits.squeeze(-1)                               # (bs, max_query_len)
+
+        outputs = (start_logits, end_logits,) + distilbert_output[1:]
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions.clamp_(0, ignored_index)
+            end_positions.clamp_(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+            outputs = (total_loss,) + outputs
+
+        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
diff --git a/Optimus/code/pytorch_transformers/modeling_gpt2.py b/Optimus/code/pytorch_transformers/modeling_gpt2.py
new file mode 100755
index 0000000000000000000000000000000000000000..6cd0526b8bbf728269a6fdacc39bfcb6c9be6376
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_gpt2.py
@@ -0,0 +1,807 @@
+# coding=utf-8
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch OpenAI GPT-2 model."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import pdb
+
+import collections
+import json
+import logging
+import math
+import os
+import sys
+from io import open
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+from torch.nn.parameter import Parameter
+
+from .modeling_utils import PreTrainedModel, Conv1D, prune_conv1d_layer, SequenceSummary
+from .configuration_gpt2 import GPT2Config
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+GPT2_PRETRAINED_MODEL_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin",
+                                     "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin",
+                                     "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin"}
+
+def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):
+    """ Load tf checkpoints in a pytorch model
+    """
+    try:
+        import re
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions.")
+        raise
+    tf_path = os.path.abspath(gpt2_checkpoint_path)
+    logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    arrays = []
+    for name, shape in init_vars:
+        logger.info("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        arrays.append(array.squeeze())
+
+    for name, array in zip(names, arrays):
+        name = name[6:]  # skip "model/"
+        name = name.split('/')
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r'[A-Za-z]+\d+', m_name):
+                l = re.split(r'(\d+)', m_name)
+            else:
+                l = [m_name]
+            if l[0] == 'w' or l[0] == 'g':
+                pointer = getattr(pointer, 'weight')
+            elif l[0] == 'b':
+                pointer = getattr(pointer, 'bias')
+            elif l[0] == 'wpe' or l[0] == 'wte':
+                pointer = getattr(pointer, l[0])
+                pointer = getattr(pointer, 'weight')
+            else:
+                pointer = getattr(pointer, l[0])
+            if len(l) >= 2:
+                num = int(l[1])
+                pointer = pointer[num]
+        try:
+            assert pointer.shape == array.shape
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        logger.info("Initialize PyTorch weight {}".format(name))
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+def gelu(x):
+    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+
+
+class Attention(nn.Module):
+    def __init__(self, nx, n_ctx, config, scale=False):
+        super(Attention, self).__init__()
+        self.output_attentions = config.output_attentions
+
+        n_state = nx  # in Attention: n_state=768 (nx=n_embd)
+        # [switch nx => n_state from Block to Attention to keep identical to TF implem]
+        assert n_state % config.n_head == 0
+        self.register_buffer("bias", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))
+        self.n_head = config.n_head
+        self.split_size = n_state
+        self.scale = scale
+
+        self.c_attn = Conv1D(n_state * 3, nx)
+        self.c_proj = Conv1D(n_state, nx)
+        self.attn_dropout = nn.Dropout(config.attn_pdrop)
+        self.resid_dropout = nn.Dropout(config.resid_pdrop)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.n_head, self.split_size // self.n_head)
+        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads
+        for head in heads:
+            # Compute how many pruned heads are before the head and move the index accordingly
+            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        index_attn = torch.cat([index, index + self.split_size, index + (2*self.split_size)])
+
+        # Prune conv1d layers
+        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)
+        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)
+
+        # Update hyper params
+        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))
+        self.n_head = self.n_head - len(heads)
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def _attn(self, q, k, v, attention_mask=None, head_mask=None):
+        w = torch.matmul(q, k)
+        if self.scale:
+            w = w / math.sqrt(v.size(-1))
+        nd, ns = w.size(-2), w.size(-1)
+        b = self.bias[:, :, ns-nd:ns, :ns]
+        w = w * b - 1e4 * (1 - b)
+
+        if attention_mask is not None:
+            # Apply the attention mask
+            w = w + attention_mask
+
+        w = nn.Softmax(dim=-1)(w)
+        w = self.attn_dropout(w)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            w = w * head_mask
+
+        outputs = [torch.matmul(w, v)]
+        if self.output_attentions:
+            outputs.append(w)
+        return outputs
+
+    def merge_heads(self, x):
+        x = x.permute(0, 2, 1, 3).contiguous()
+        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)
+        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states
+
+    def split_heads(self, x, k=False):
+        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)
+        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states
+        if k:
+            return x.permute(0, 2, 3, 1)  # (batch, head, head_features, seq_length)
+        else:
+            return x.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)
+
+    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None):
+        x = self.c_attn(x)
+        query, key, value = x.split(self.split_size, dim=2)
+        query = self.split_heads(query)
+        key = self.split_heads(key, k=True)
+        value = self.split_heads(value)
+
+        
+        if layer_past is not None:
+            past_key, past_value = layer_past[0], layer_past[1]  # transpose back cf below
+            
+            past_key = self.split_heads(past_key, k=True)
+            past_value = self.split_heads(past_value)
+            # pdb.set_trace()
+            key = torch.cat((past_key, key), dim=-1)
+            value = torch.cat((past_value, value), dim=-2)
+        present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking
+
+        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)
+        a = attn_outputs[0]
+
+        a = self.merge_heads(a)
+        a = self.c_proj(a)
+        a = self.resid_dropout(a)
+
+        outputs = [a, present] + attn_outputs[1:]
+        return outputs  # a, present, (attentions)
+
+
+class MLP(nn.Module):
+    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)
+        super(MLP, self).__init__()
+        nx = config.n_embd
+        self.c_fc = Conv1D(n_state, nx)
+        self.c_proj = Conv1D(nx, n_state)
+        self.act = gelu
+        self.dropout = nn.Dropout(config.resid_pdrop)
+
+    def forward(self, x):
+        h = self.act(self.c_fc(x))
+        h2 = self.c_proj(h)
+        return self.dropout(h2)
+
+
+class Block(nn.Module):
+    def __init__(self, n_ctx, config, scale=False):
+        super(Block, self).__init__()
+        nx = config.n_embd
+        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)
+        self.attn = Attention(nx, n_ctx, config, scale)
+        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)
+        self.mlp = MLP(4 * nx, config)
+
+    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None):
+        output_attn = self.attn(self.ln_1(x),
+                                layer_past=layer_past,
+                                attention_mask=attention_mask,
+                                head_mask=head_mask)
+        a = output_attn[0]  # output_attn: a, present, (attentions)
+
+        x = x + a
+        m = self.mlp(self.ln_2(x))
+        x = x + m
+
+        outputs = [x] + output_attn[1:]
+        return outputs  # x, present, (attentions)
+
+
+class GPT2PreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = GPT2Config
+    pretrained_model_archive_map = GPT2_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = load_tf_weights_in_gpt2
+    base_model_prefix = "transformer"
+
+    def __init__(self, *inputs, **kwargs):
+        super(GPT2PreTrainedModel, self).__init__(*inputs, **kwargs)
+
+    def _init_weights(self, module):
+        """ Initialize the weights.
+        """
+        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+GPT2_START_DOCSTRING = r"""    OpenAI GPT-2 model was proposed in
+    `Language Models are Unsupervised Multitask Learners`_
+    by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
+    It's a causal (unidirectional) transformer pre-trained using  language modeling on a very large
+    corpus of ~40 GB of text data.
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`Language Models are Unsupervised Multitask Learners`:
+        https://openai.com/blog/better-language-models/
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~pytorch_transformers.GPT2Config`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+GPT2_INPUTS_DOCSTRING = r"""    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+            Indices can be obtained using :class:`pytorch_transformers.GPT2Tokenizer`.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            (see `past` output below). Can be used to speed up sequential decoding.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens (can be used to indicate various portions of the inputs).
+            The embeddings from these tokens will be summed with the respective token embeddings.
+            Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
+        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.",
+                      GPT2_START_DOCSTRING, GPT2_INPUTS_DOCSTRING)
+class GPT2Model(GPT2PreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the last layer of the model.
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2Model.from_pretrained('gpt2')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config):
+        super(GPT2Model, self).__init__(config)
+        self.output_hidden_states = config.output_hidden_states
+        self.output_attentions = config.output_attentions
+
+        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
+        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
+        self.drop = nn.Dropout(config.embd_pdrop)
+        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
+        self.ln_f = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
+
+        try:
+            self.latent_size = config.latent_size
+        except: 
+            self.latent_size = 32 # default size is 32
+
+        self.linear = nn.Linear(self.latent_size, config.hidden_size * config.n_layer, bias=False) # different latent vector for each layer 
+        self.linear_emb = nn.Linear(self.latent_size, config.hidden_size, bias=False) # share the same latent vector as the embeddings
+
+        self.config = config
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        self.wte = self._get_resized_embeddings(self.wte, new_num_tokens)
+        return self.wte
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+        """
+        for layer, heads in heads_to_prune.items():
+            self.h[layer].attn.prune_heads(heads)
+
+    def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, latent_as_gpt_emb=False, latent_as_gpt_memory=True):
+
+        if past is None:
+            past_length = 0
+            past = [None] * len(self.h)
+        else:
+            
+
+            if latent_as_gpt_emb:
+                past_emb = self.linear_emb(past) # used as embeddings to add on other three embeddings
+
+            if latent_as_gpt_memory:
+                past = self.linear(past)
+                share_latent = False
+                if share_latent: 
+                    # the same latent vector shared by all layers
+                    past = [past.unsqueeze(-2), past.unsqueeze(-2)] # query, key
+                    past = [past] * len(self.h)
+                    past_length = past[0][0].size(-2)
+                else: 
+                    # different latent vectors for each layer
+                    past_split = torch.split(past.unsqueeze(1), self.config.hidden_size, dim=2)
+                    past = list(zip(past_split,past_split))
+                    
+                    # past = past.view(batch_size,len(self.h),-1)
+                    # past = [[past[:,i,:].unsqueeze(-2), past[:,i,:].unsqueeze(-2) ] for i in range(len(self.h))]
+                    past_length = 1 # past[0][0].size(-2)
+            else:
+                past_length = 0
+                past = [None] * len(self.h)
+
+
+        if position_ids is None:
+            position_ids = torch.arange(past_length, input_ids.size(-1) + past_length, dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+
+
+        # Attention mask.
+        if attention_mask is not None:
+            # We create a 3D attention mask from a 2D tensor mask.
+            # Sizes are [batch_size, 1, 1, to_seq_length]
+            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+            # this attention mask is more simple than the triangular masking of causal attention
+            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+
+            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+            # masked positions, this operation will create a tensor which is 0.0 for
+            # positions we want to attend and -10000.0 for masked positions.
+            # Since we are adding it to the raw scores before the softmax, this is
+            # effectively the same as removing these entirely.
+            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
+            attention_mask = (1.0 - attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # head_mask has shape n_layer x batch x n_heads x N x N
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.n_layer, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.n_layer
+
+        
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_ids.size(-1))
+        position_ids = position_ids.view(-1, position_ids.size(-1))
+
+
+        inputs_embeds = self.wte(input_ids)
+        position_embeds = self.wpe(position_ids)
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
+            token_type_embeds = self.wte(token_type_ids)
+        else:
+            token_type_embeds = 0
+
+
+        hidden_states = inputs_embeds + position_embeds + token_type_embeds
+        if latent_as_gpt_emb:
+            # pdb.set_trace()
+            hidden_states = hidden_states + past_emb.unsqueeze(1)
+
+        hidden_states = self.drop(hidden_states)
+
+        output_shape = input_shape + (hidden_states.size(-1),)
+
+        presents = ()
+        all_attentions = []
+        all_hidden_states = ()
+        for i, (block, layer_past) in enumerate(zip(self.h, past)):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
+
+            
+            outputs = block(hidden_states,
+                            layer_past=layer_past,
+                            attention_mask=attention_mask,
+                            head_mask=head_mask[i])
+
+            
+            hidden_states, present = outputs[:2]
+            presents = presents + (present,)
+
+            if self.output_attentions:
+                all_attentions.append(outputs[2])
+
+        hidden_states = self.ln_f(hidden_states)
+
+        hidden_states = hidden_states.view(*output_shape)
+        # Add last hidden state
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        outputs = (hidden_states, presents)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+        if self.output_attentions:
+            # let the number of heads free (-1) so we can extract attention even after head pruning
+            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
+            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
+            outputs = outputs + (all_attentions,)
+        return outputs  # last hidden state, presents, (all hidden_states), (attentions)
+
+
+@add_start_docstrings("""The GPT2 Model transformer with a language modeling head on top
+(linear layer with weights tied to the input embeddings). """, GPT2_START_DOCSTRING, GPT2_INPUTS_DOCSTRING)
+class GPT2LMHeadModel(GPT2PreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        import torch
+        from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
+
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2LMHeadModel.from_pretrained('gpt2')
+
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(GPT2LMHeadModel, self).__init__(config)
+        self.transformer = GPT2Model(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+
+        self.init_weights()
+        self.tie_weights()
+
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head,
+                                   self.transformer.wte)
+
+    def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                labels=None, label_ignore=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               past=past,
+                                               attention_mask=attention_mask,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               head_mask=head_mask)
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        outputs = (lm_logits,) + transformer_outputs[1:]
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(ignore_index=label_ignore, reduce=False) # 50258 is the padding id, otherwise -1 is used for masked LM.
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                            shift_labels.view(-1))
+            loss = torch.sum(loss.view(-1, shift_labels.shape[-1]), -1)
+            outputs = (loss,) + outputs
+
+
+        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)
+
+
+
+@add_start_docstrings("""The GPT2 Model transformer with a language modeling head on top
+(linear layer with weights tied to the input embeddings). """, GPT2_START_DOCSTRING, GPT2_INPUTS_DOCSTRING)
+class GPT2ForLatentConnector(GPT2PreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        import torch
+        from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
+
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2LMHeadModel.from_pretrained('gpt2')
+
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config, latent_size=32, latent_as_gpt_emb=True, latent_as_gpt_memory=True):
+        
+        super(GPT2ForLatentConnector, self).__init__(config)
+
+        
+        self.transformer = GPT2Model(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+
+        self.init_weights()
+        self.tie_weights()
+
+        self.latent_as_gpt_emb = latent_as_gpt_emb
+        self.latent_as_gpt_memory = latent_as_gpt_memory
+
+
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head,
+                                   self.transformer.wte)
+
+    def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                labels=None, label_ignore=None):
+
+
+        transformer_outputs = self.transformer(input_ids,
+                                               past=past,
+                                               attention_mask=attention_mask,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               head_mask=head_mask, 
+                                               latent_as_gpt_emb=self.latent_as_gpt_emb,
+                                               latent_as_gpt_memory=self.latent_as_gpt_memory)
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        outputs = (lm_logits,) + transformer_outputs[1:]
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(ignore_index=label_ignore, reduce=False) # 50258 is the padding id, otherwise -1 is used for masked LM.
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                            shift_labels.view(-1))
+            loss = torch.sum(loss.view(-1, shift_labels.shape[-1]), -1)
+            outputs = (loss,) + outputs
+
+
+        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)
+
+@add_start_docstrings("""The GPT2 Model transformer with a language modeling and a multiple-choice classification
+head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.
+The language modeling head has its weights tied to the input embeddings,
+the classification head takes as input the input of a specified classification token index in the input sequence).
+""", GPT2_START_DOCSTRING, GPT2_INPUTS_DOCSTRING)
+class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
+    r"""
+        **mc_token_ids**: (`optional`, default to index of the last token of the input) ``torch.LongTensor`` of shape ``(batch_size, num_choices)``:
+            Index of the classification token in each input sequence.
+            Selected in the range ``[0, input_ids.size(-1) - 1[``.
+        **lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+        **mc_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size)``:
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above)
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **lm_loss**: (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **mc_loss**: (`optional`, returned when ``multiple_choice_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Multiple choice classification loss.
+        **lm_prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **mc_prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)``
+            Prediction scores of the multiplechoice classification head (scores for each choice before SoftMax).
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        import torch
+        from pytorch_transformers import GPT2Tokenizer, GPT2DoubleHeadsModel
+        
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
+        
+        # Add a [CLS] to the vocabulary (we should train it also!)
+        tokenizer.add_special_tokens({'cls_token': '[CLS]'})
+        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size
+        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary
+        
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
+        encoded_choices = [tokenizer.encode(s) for s in choices]
+        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]
+
+        input_ids = torch.tensor(encoded_choices).unsqueeze(0)  # Batch size: 1, number of choices: 2
+        mc_token_ids = torch.tensor([cls_token_location])  # Batch size: 1
+
+        outputs = model(input_ids, mc_token_ids=mc_token_ids)
+        lm_prediction_scores, mc_prediction_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(GPT2DoubleHeadsModel, self).__init__(config)
+        self.transformer = GPT2Model(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        self.multiple_choice_head = SequenceSummary(config)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head,
+                                   self.transformer.wte)
+
+    def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                mc_token_ids=None, lm_labels=None, mc_labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               past=past,
+                                               attention_mask=attention_mask,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               head_mask=head_mask)
+
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)
+
+        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]
+        if mc_labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)),
+                            mc_labels.view(-1))
+            outputs = (loss,) + outputs
+        if lm_labels is not None:
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = lm_labels[..., 1:].contiguous()
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                            shift_labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (lm loss), (mc loss), lm logits, mc logits, presents, (all hidden_states), (attentions)
diff --git a/Optimus/code/pytorch_transformers/modeling_openai.py b/Optimus/code/pytorch_transformers/modeling_openai.py
new file mode 100755
index 0000000000000000000000000000000000000000..4b02baf2f4b4f1cda96ac3499808fb169e0f3993
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_openai.py
@@ -0,0 +1,621 @@
+# coding=utf-8
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch OpenAI GPT model."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import json
+import logging
+import math
+import os
+import sys
+from io import open
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+from torch.nn.parameter import Parameter
+
+from .modeling_utils import PreTrainedModel, Conv1D, prune_conv1d_layer, SequenceSummary
+from .configuration_openai import OpenAIGPTConfig
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP = {"openai-gpt": "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-pytorch_model.bin"}
+
+
+def load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path):
+    """ Load tf pre-trained weights in a pytorch model (from NumPy arrays here)
+    """
+    import re
+    import numpy as np
+
+    if '.ckpt' in openai_checkpoint_folder_path:
+        openai_checkpoint_folder_path = os.path.dirname(openai_checkpoint_folder_path)
+
+    logger.info("Loading weights from {}".format(openai_checkpoint_folder_path))
+
+    names = json.load(open(openai_checkpoint_folder_path + '/parameters_names.json', "r", encoding='utf-8'))
+    shapes = json.load(open(openai_checkpoint_folder_path + '/params_shapes.json', "r", encoding='utf-8'))
+    offsets = np.cumsum([np.prod(shape) for shape in shapes])
+    init_params = [np.load(openai_checkpoint_folder_path + '/params_{}.npy'.format(n)) for n in range(10)]
+    init_params = np.split(np.concatenate(init_params, 0), offsets)[:-1]
+    init_params = [param.reshape(shape) for param, shape in zip(init_params, shapes)]
+
+    # This was used when we had a single embedding matrix for positions and tokens
+    # init_params[0] = np.concatenate([init_params[1], init_params[0]], 0)
+    # del init_params[1]
+    init_params = [arr.squeeze() for arr in init_params]
+
+    try:
+        assert model.tokens_embed.weight.shape == init_params[1].shape
+        assert model.positions_embed.weight.shape == init_params[0].shape
+    except AssertionError as e:
+        e.args += (model.tokens_embed.weight.shape, init_params[1].shape)
+        e.args += (model.positions_embed.weight.shape, init_params[0].shape)
+        raise
+
+    model.tokens_embed.weight.data = torch.from_numpy(init_params[1])
+    model.positions_embed.weight.data = torch.from_numpy(init_params[0])
+    names.pop(0)
+    # Pop position and token embedding arrays
+    init_params.pop(0)
+    init_params.pop(0)
+
+    for name, array in zip(names, init_params): # names[1:n_transfer], init_params[1:n_transfer]):
+        name = name[6:]  # skip "model/"
+        assert name[-2:] == ":0"
+        name = name[:-2]
+        name = name.split('/')
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r'[A-Za-z]+\d+', m_name):
+                l = re.split(r'(\d+)', m_name)
+            else:
+                l = [m_name]
+            if l[0] == 'g':
+                pointer = getattr(pointer, 'weight')
+            elif l[0] == 'b':
+                pointer = getattr(pointer, 'bias')
+            elif l[0] == 'w':
+                pointer = getattr(pointer, 'weight')
+            else:
+                pointer = getattr(pointer, l[0])
+            if len(l) >= 2:
+                num = int(l[1])
+                pointer = pointer[num]
+        try:
+            assert pointer.shape == array.shape
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        try:
+            assert pointer.shape == array.shape
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        logger.info("Initialize PyTorch weight {}".format(name))
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+def gelu(x):
+    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT_FNS = {"relu": nn.ReLU, "swish": swish, "gelu": gelu}
+
+
+class Attention(nn.Module):
+    def __init__(self, nx, n_ctx, config, scale=False):
+        super(Attention, self).__init__()
+        n_state = nx  # in Attention: n_state=768 (nx=n_embd)
+        # [switch nx => n_state from Block to Attention to keep identical to TF implem]
+        assert n_state % config.n_head == 0
+        self.register_buffer("bias", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))
+        self.n_head = config.n_head
+        self.split_size = n_state
+        self.scale = scale
+
+        self.output_attentions = config.output_attentions
+
+        self.c_attn = Conv1D(n_state * 3, nx)
+        self.c_proj = Conv1D(n_state, nx)
+        self.attn_dropout = nn.Dropout(config.attn_pdrop)
+        self.resid_dropout = nn.Dropout(config.resid_pdrop)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.n_head, self.split_size // self.n_head)
+        heads = set(heads) - self.pruned_heads
+        for head in heads:
+            head -= sum(1 if h < head else 0 for h in self.pruned_heads)
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        index_attn = torch.cat([index, index + self.split_size, index + (2*self.split_size)])
+        # Prune conv1d layers
+        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)
+        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)
+        # Update hyper params
+        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))
+        self.n_head = self.n_head - len(heads)
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def _attn(self, q, k, v, attention_mask=None, head_mask=None):
+        w = torch.matmul(q, k)
+        if self.scale:
+            w = w / math.sqrt(v.size(-1))
+        # w = w * self.bias + -1e9 * (1 - self.bias)  # TF implem method: mask_attn_weights
+        # XD: self.b may be larger than w, so we need to crop it
+        b = self.bias[:, :, : w.size(-2), : w.size(-1)]
+        w = w * b + -1e9 * (1 - b)
+
+        if attention_mask is not None:
+            # Apply the attention mask
+            w = w + attention_mask
+
+        w = nn.Softmax(dim=-1)(w)
+        w = self.attn_dropout(w)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            w = w * head_mask
+
+        outputs = [torch.matmul(w, v)]
+        if self.output_attentions:
+            outputs.append(w)
+        return outputs
+
+    def merge_heads(self, x):
+        x = x.permute(0, 2, 1, 3).contiguous()
+        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)
+        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states
+
+    def split_heads(self, x, k=False):
+        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)
+        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states
+        if k:
+            return x.permute(0, 2, 3, 1)
+        else:
+            return x.permute(0, 2, 1, 3)
+
+    def forward(self, x, attention_mask=None, head_mask=None):
+        x = self.c_attn(x)
+        query, key, value = x.split(self.split_size, dim=2)
+        query = self.split_heads(query)
+        key = self.split_heads(key, k=True)
+        value = self.split_heads(value)
+
+        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)
+        a = attn_outputs[0]
+
+        a = self.merge_heads(a)
+        a = self.c_proj(a)
+        a = self.resid_dropout(a)
+
+        outputs = [a] + attn_outputs[1:]
+        return outputs  # a, (attentions)
+
+
+class MLP(nn.Module):
+    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)
+        super(MLP, self).__init__()
+        nx = config.n_embd
+        self.c_fc = Conv1D(n_state, nx)
+        self.c_proj = Conv1D(nx, n_state)
+        self.act = ACT_FNS[config.afn]
+        self.dropout = nn.Dropout(config.resid_pdrop)
+
+    def forward(self, x):
+        h = self.act(self.c_fc(x))
+        h2 = self.c_proj(h)
+        return self.dropout(h2)
+
+
+class Block(nn.Module):
+    def __init__(self, n_ctx, config, scale=False):
+        super(Block, self).__init__()
+        nx = config.n_embd
+        self.attn = Attention(nx, n_ctx, config, scale)
+        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)
+        self.mlp = MLP(4 * nx, config)
+        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)
+
+    def forward(self, x, attention_mask=None, head_mask=None):
+        attn_outputs = self.attn(x, attention_mask=attention_mask, head_mask=head_mask)
+        a = attn_outputs[0]
+
+        n = self.ln_1(x + a)
+        m = self.mlp(n)
+        h = self.ln_2(n + m)
+
+        outputs = [h] + attn_outputs[1:]
+        return outputs
+
+
+class OpenAIGPTPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = OpenAIGPTConfig
+    pretrained_model_archive_map = OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = load_tf_weights_in_openai_gpt
+    base_model_prefix = "transformer"
+
+    def _init_weights(self, module):
+        """ Initialize the weights.
+        """
+        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+OPENAI_GPT_START_DOCSTRING = r"""    OpenAI GPT model was proposed in
+    `Improving Language Understanding by Generative Pre-Training`_
+    by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+    It's a causal (unidirectional) transformer pre-trained using language modeling on a large
+    corpus will long range dependencies, the Toronto Book Corpus.
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`Improving Language Understanding by Generative Pre-Training`:
+        https://openai.com/blog/language-unsupervised/
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~pytorch_transformers.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+OPENAI_GPT_INPUTS_DOCSTRING = r"""    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+            Indices can be obtained using :class:`pytorch_transformers.BPT2Tokenizer`.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens (can be used to indicate various portions of the inputs).
+            The embeddings from these tokens will be summed with the respective token embeddings.
+            Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices)
+        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare OpenAI GPT transformer model outputting raw hidden-states without any specific head on top.",
+                      OPENAI_GPT_START_DOCSTRING, OPENAI_GPT_INPUTS_DOCSTRING)
+class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the last layer of the model.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTModel.from_pretrained('openai-gpt')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config):
+        super(OpenAIGPTModel, self).__init__(config)
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        self.tokens_embed = nn.Embedding(config.vocab_size, config.n_embd)
+        self.positions_embed = nn.Embedding(config.n_positions, config.n_embd)
+        self.drop = nn.Dropout(config.embd_pdrop)
+        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        self.tokens_embed = self._get_resized_embeddings(self.tokens_embed, new_num_tokens)
+        return self.tokens_embed
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+        """
+        for layer, heads in heads_to_prune.items():
+            self.h[layer].attn.prune_heads(heads)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
+        if position_ids is None:
+            # This was used when we had a single embedding matrice from position and token embeddings
+            # start = self.config.vocab_size + self.config.n_special
+            # end = start + input_ids.size(-1)
+            # position_ids = torch.arange(start, end, dtype=torch.long, device=input_ids.device)
+            position_ids = torch.arange(input_ids.size(-1), dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+
+        # Attention mask.
+        if attention_mask is not None:
+            # We create a 3D attention mask from a 2D tensor mask.
+            # Sizes are [batch_size, 1, 1, to_seq_length]
+            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+            # this attention mask is more simple than the triangular masking of causal attention
+            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+
+            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+            # masked positions, this operation will create a tensor which is 0.0 for
+            # positions we want to attend and -10000.0 for masked positions.
+            # Since we are adding it to the raw scores before the softmax, this is
+            # effectively the same as removing these entirely.
+            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
+            attention_mask = (1.0 - attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # head_mask has shape n_layer x batch x n_heads x N x N
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.n_layer, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.n_layer
+
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_ids.size(-1))
+        position_ids = position_ids.view(-1, position_ids.size(-1))
+
+        inputs_embeds = self.tokens_embed(input_ids)
+        position_embeds = self.positions_embed(position_ids)
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
+            token_type_embeds = self.tokens_embed(token_type_ids)
+        else:
+            token_type_embeds = 0
+        hidden_states = inputs_embeds + position_embeds + token_type_embeds
+        hidden_states = self.drop(hidden_states)
+
+        output_shape = input_shape + (hidden_states.size(-1),)
+
+        all_attentions = ()
+        all_hidden_states = ()
+        for i, block in enumerate(self.h):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
+
+            outputs = block(hidden_states, attention_mask, head_mask[i])
+            hidden_states = outputs[0]
+            if self.output_attentions:
+                all_attentions = all_attentions + (outputs[1],)
+
+        # Add last layer
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
+
+        outputs = (hidden_states.view(*output_shape),)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+        if self.output_attentions:
+            outputs = outputs + (all_attentions,)
+        return outputs  # last hidden state, (all hidden states), (all attentions)
+
+
+@add_start_docstrings("""OpenAI GPT Model transformer with a language modeling head on top
+(linear layer with weights tied to the input embeddings). """, OPENAI_GPT_START_DOCSTRING, OPENAI_GPT_INPUTS_DOCSTRING)
+class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(OpenAIGPTLMHeadModel, self).__init__(config)
+        self.transformer = OpenAIGPTModel(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head,
+                                   self.transformer.tokens_embed)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               head_mask=head_mask)
+        hidden_states = transformer_outputs[0]
+        lm_logits = self.lm_head(hidden_states)
+
+        outputs = (lm_logits,) + transformer_outputs[1:]
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                            shift_labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), lm_logits, (all hidden states), (all attentions)
+
+
+@add_start_docstrings("""OpenAI GPT Model transformer with a language modeling and a multiple-choice classification
+head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.
+The language modeling head has its weights tied to the input embeddings,
+the classification head takes as input the input of a specified classification token index in the input sequence).
+""", OPENAI_GPT_START_DOCSTRING, OPENAI_GPT_INPUTS_DOCSTRING)
+class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
+    r"""
+        **mc_token_ids**: (`optional`, default to index of the last token of the input) ``torch.LongTensor`` of shape ``(batch_size, num_choices)``:
+            Index of the classification token in each input sequence.
+            Selected in the range ``[0, input_ids.size(-1) - 1[``.
+        **lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+        **mc_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size)``:
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above)
+
+            `multiple_choice_labels`: optional multiple choice labels: ``torch.LongTensor`` of shape [batch_size]
+                with indices selected in [0, ..., num_choices].
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **lm_loss**: (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **mc_loss**: (`optional`, returned when ``multiple_choice_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Multiple choice classification loss.
+        **lm_prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **mc_prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)``
+            Prediction scores of the multiplechoice classification head (scores for each choice before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
+        tokenizer.add_special_tokens({'cls_token': '[CLS]'})  # Add a [CLS] to the vocabulary (we should train it also!)
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        mc_token_ids = torch.tensor([input_ids.size(-1), input_ids.size(-1)]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, mc_token_ids=mc_token_ids)
+        lm_prediction_scores, mc_prediction_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(OpenAIGPTDoubleHeadsModel, self).__init__(config)
+
+        self.transformer = OpenAIGPTModel(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        self.multiple_choice_head = SequenceSummary(config)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head,
+                                   self.transformer.tokens_embed)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                mc_token_ids=None, lm_labels=None, mc_labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               head_mask=head_mask)
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)
+
+        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]
+        if mc_labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)),
+                            mc_labels.view(-1))
+            outputs = (loss,) + outputs
+        if lm_labels is not None:
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = lm_labels[..., 1:].contiguous()
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                            shift_labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (lm loss), (mc loss), lm logits, mc logits, (all hidden_states), (attentions)
diff --git a/Optimus/code/pytorch_transformers/modeling_roberta.py b/Optimus/code/pytorch_transformers/modeling_roberta.py
new file mode 100755
index 0000000000000000000000000000000000000000..1cc4147f436dfbcf881cee6d2ce80d7453d1832e
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_roberta.py
@@ -0,0 +1,472 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch RoBERTa model. """
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import pdb
+
+import logging
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss, MSELoss
+
+from .modeling_bert import BertEmbeddings, BertLayerNorm, BertModel, BertPreTrainedModel, gelu
+from .configuration_roberta import RobertaConfig
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.bin",
+    'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-pytorch_model.bin",
+    'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-pytorch_model.bin",
+}
+
+class RobertaEmbeddings(BertEmbeddings):
+    """
+    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
+    """
+    def __init__(self, config):
+        super(RobertaEmbeddings, self).__init__(config)
+        self.padding_idx = 1
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        seq_length = input_ids.size(1)
+        if position_ids is None:
+            # Position numbers begin at padding_idx+1. Padding symbols are ignored.
+            # cf. fairseq's `utils.make_positions`
+            position_ids = torch.arange(self.padding_idx+1, seq_length+self.padding_idx+1, dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+        return super(RobertaEmbeddings, self).forward(input_ids,
+                                                      token_type_ids=token_type_ids,
+                                                      position_ids=position_ids)
+
+
+ROBERTA_START_DOCSTRING = r"""    The RoBERTa model was proposed in
+    `RoBERTa: A Robustly Optimized BERT Pretraining Approach`_
+    by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
+    Veselin Stoyanov. It is based on Google's BERT model released in 2018.
+    
+    It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
+    objective and training with much larger mini-batches and learning rates.
+    
+    This implementation is the same as BertModel with a tiny embeddings tweak as well as a setup for Roberta pretrained 
+    models.
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`RoBERTa: A Robustly Optimized BERT Pretraining Approach`:
+        https://arxiv.org/abs/1907.11692
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~pytorch_transformers.RobertaConfig`): Model configuration class with all the parameters of the 
+            model. Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+ROBERTA_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            To match pre-training, RoBERTa input sequence should be formatted with <s> and </s> tokens as follows:
+
+            (a) For sequence pairs:
+
+                ``tokens:         <s> Is this Jacksonville ? </s> </s> No it is not . </s>``
+
+            (b) For single sequences:
+
+                ``tokens:         <s> the dog is hairy . </s>``
+
+            Fully encoded sequences or sequence pairs can be obtained using the RobertaTokenizer.encode function with 
+            the ``add_special_tokens`` parameter set to ``True``.
+
+            RoBERTa is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **token_type_ids**: (`optional` need to be trained) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Optional segment token indices to indicate first and second portions of the inputs.
+            This embedding matrice is not trained (not pretrained during RoBERTa pretraining), you will have to train it
+            during finetuning.
+            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
+            corresponds to a `sentence B` token
+            (see `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
+        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.",
+                      ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
+class RobertaModel(BertModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the output of the last layer of the model.
+        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
+            Last layer hidden-state of the first token of the sequence (classification token)
+            further processed by a Linear layer and a Tanh activation function. The Linear
+            layer weights are trained from the next sentence prediction (classification)
+            objective during Bert pretraining. This output is usually *not* a good summary
+            of the semantic content of the input, you're often better with averaging or pooling
+            the sequence of hidden-states for the whole input sequence.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+        model = RobertaModel.from_pretrained('roberta-base')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    config_class = RobertaConfig
+    pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "roberta"
+
+    def __init__(self, config):
+        super(RobertaModel, self).__init__(config)
+
+        self.embeddings = RobertaEmbeddings(config)
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
+        if input_ids[:, 0].sum().item() != 0:
+            logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
+                           "This model requires special tokens in order to work. "
+                           "Please specify add_special_tokens=True in your encoding.")
+        return super(RobertaModel, self).forward(input_ids,
+                                                 attention_mask=attention_mask,
+                                                 token_type_ids=token_type_ids,
+                                                 position_ids=position_ids,
+                                                 head_mask=head_mask)
+
+
+@add_start_docstrings("""RoBERTa Model with a `language modeling` head on top. """,
+    ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
+class RobertaForMaskedLM(BertPreTrainedModel):
+    r"""
+        **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Masked language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+        model = RobertaForMaskedLM.from_pretrained('roberta-base')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, masked_lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
+
+    """
+    config_class = RobertaConfig
+    pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "roberta"
+
+    def __init__(self, config):
+        super(RobertaForMaskedLM, self).__init__(config)
+
+        self.roberta = RobertaModel(config)
+        self.lm_head = RobertaLMHead(config)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head.decoder, self.roberta.embeddings.word_embeddings)
+
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                masked_lm_labels=None):
+      
+        outputs = self.roberta(input_ids,
+                               attention_mask=attention_mask,
+                               token_type_ids=token_type_ids,
+                               position_ids=position_ids,
+                               head_mask=head_mask)
+
+
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here
+
+        
+        if masked_lm_labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            outputs = (masked_lm_loss,) + outputs
+
+        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
+
+
+class RobertaLMHead(nn.Module):
+    """Roberta Head for masked language modeling."""
+
+    def __init__(self, config):
+        super(RobertaLMHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+    def forward(self, features, **kwargs):
+        x = self.dense(features)
+        x = gelu(x)
+        x = self.layer_norm(x)
+
+        # project back to size of vocabulary with bias
+        x = self.decoder(x) + self.bias
+
+        return x
+
+
+@add_start_docstrings("""RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer 
+    on top of the pooled output) e.g. for GLUE tasks. """,
+    ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
+class RobertaForSequenceClassification(BertPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+        model = RobertaForSequenceClassification.from_pretrained('roberta-base')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
+    """
+    config_class = RobertaConfig
+    pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "roberta"
+
+    def __init__(self, config):
+        super(RobertaForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.roberta = RobertaModel(config)
+        self.classifier = RobertaClassificationHead(config)
+    
+    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                labels=None):
+        outputs = self.roberta(input_ids,
+                               attention_mask=attention_mask,
+                               token_type_ids=token_type_ids,
+                               position_ids=position_ids,
+                               head_mask=head_mask)
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,) + outputs[2:]
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), logits, (hidden_states), (attentions)
+
+@add_start_docstrings("""Roberta Model with a multiple choice classification head on top (a linear layer on top of
+    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
+    ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
+class RobertaForMultipleChoice(BertPreTrainedModel):
+    r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            The second dimension of the input (`num_choices`) indicates the number of choices to score.
+            To match pre-training, RoBerta input sequence should be formatted with [CLS] and [SEP] tokens as follows:
+
+            (a) For sequence pairs:
+
+                ``tokens:         [CLS] is this jack ##son ##ville ? [SEP] [SEP] no it is not . [SEP]``
+
+                ``token_type_ids:   0   0  0    0    0     0       0   0   0     1  1  1  1   1   1``
+
+            (b) For single sequences:
+
+                ``tokens:         [CLS] the dog is hairy . [SEP]``
+
+                ``token_type_ids:   0   0   0   0  0     0   0``
+
+            Indices can be obtained using :class:`pytorch_transformers.BertTokenizer`.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+            Segment token indices to indicate first and second portions of the inputs.
+            The second dimension of the input (`num_choices`) indicates the number of choices to score.
+            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            The second dimension of the input (`num_choices`) indicates the number of choices to score.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above)
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification loss.
+        **classification_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above).
+            Classification scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+        model = RobertaForMultipleChoice.from_pretrained('roberta-base')
+        choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
+        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, classification_scores = outputs[:2]
+
+    """
+    config_class = RobertaConfig
+    pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "roberta"
+
+    def __init__(self, config):
+        super(RobertaForMultipleChoice, self).__init__(config)
+
+        self.roberta = RobertaModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+        self.init_weights()
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
+                position_ids=None, head_mask=None):
+        num_choices = input_ids.shape[1]
+
+        flat_input_ids = input_ids.view(-1, input_ids.size(-1))
+        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
+        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        outputs = self.roberta(flat_input_ids, position_ids=flat_position_ids, token_type_ids=flat_token_type_ids,
+                            attention_mask=flat_attention_mask, head_mask=head_mask)
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+
+        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)
+
+
+
+class RobertaClassificationHead(nn.Module):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config):
+        super(RobertaClassificationHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, features, **kwargs):
+        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = torch.tanh(x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
diff --git a/Optimus/code/pytorch_transformers/modeling_transfo_xl.py b/Optimus/code/pytorch_transformers/modeling_transfo_xl.py
new file mode 100755
index 0000000000000000000000000000000000000000..73b04eee605f1bfc9bb02bea644a732c426c1d5f
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_transfo_xl.py
@@ -0,0 +1,1240 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Transformer XL model.
+    Adapted from https://github.com/kimiyoung/transformer-xl.
+    In particular https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py
+"""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import json
+import math
+import logging
+import collections
+import sys
+from io import open
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import CrossEntropyLoss
+from torch.nn.parameter import Parameter
+
+from .modeling_utils import PreTrainedModel, Conv1D, prune_conv1d_layer, SequenceSummary
+from .configuration_transfo_xl import TransfoXLConfig
+from .modeling_transfo_xl_utilities import ProjectedAdaptiveLogSoftmax, sample_logits
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-pytorch_model.bin",
+}
+
+def build_tf_to_pytorch_map(model, config):
+    """ A map of modules from TF to PyTorch.
+        This time I use a map to keep the PyTorch model as identical to the original PyTorch model as possible.
+    """
+    tf_to_pt_map = {}
+
+    if hasattr(model, 'transformer'):
+        # We are loading in a TransfoXLLMHeadModel => we will load also the Adaptive Softmax
+        tf_to_pt_map.update({
+            "transformer/adaptive_softmax/cutoff_0/cluster_W": model.crit.cluster_weight,
+            "transformer/adaptive_softmax/cutoff_0/cluster_b": model.crit.cluster_bias})
+        for i, (out_l, proj_l, tie_proj) in enumerate(zip(
+                                model.crit.out_layers,
+                                model.crit.out_projs,
+                                config.tie_projs)):
+            layer_str = "transformer/adaptive_softmax/cutoff_%d/" % i
+            if config.tie_weight:
+                tf_to_pt_map.update({
+                    layer_str + 'b': out_l.bias})
+            else:
+                raise NotImplementedError
+                # I don't think this is implemented in the TF code
+                tf_to_pt_map.update({
+                    layer_str + 'lookup_table': out_l.weight,
+                    layer_str + 'b': out_l.bias})
+            if not tie_proj:
+                tf_to_pt_map.update({
+                    layer_str + 'proj': proj_l
+                    })
+        # Now load the rest of the transformer
+        model = model.transformer
+
+    # Embeddings
+    for i, (embed_l, proj_l) in enumerate(zip(model.word_emb.emb_layers, model.word_emb.emb_projs)):
+        layer_str = "transformer/adaptive_embed/cutoff_%d/" % i
+        tf_to_pt_map.update({
+            layer_str + 'lookup_table': embed_l.weight,
+            layer_str + 'proj_W': proj_l
+            })
+
+    # Transformer blocks
+    for i, b in enumerate(model.layers):
+        layer_str = "transformer/layer_%d/" % i
+        tf_to_pt_map.update({
+            layer_str + "rel_attn/LayerNorm/gamma": b.dec_attn.layer_norm.weight,
+            layer_str + "rel_attn/LayerNorm/beta": b.dec_attn.layer_norm.bias,
+            layer_str + "rel_attn/o/kernel": b.dec_attn.o_net.weight,
+            layer_str + "rel_attn/qkv/kernel": b.dec_attn.qkv_net.weight,
+            layer_str + "rel_attn/r/kernel": b.dec_attn.r_net.weight,
+            layer_str + "ff/LayerNorm/gamma": b.pos_ff.layer_norm.weight,
+            layer_str + "ff/LayerNorm/beta": b.pos_ff.layer_norm.bias,
+            layer_str + "ff/layer_1/kernel": b.pos_ff.CoreNet[0].weight,
+            layer_str + "ff/layer_1/bias": b.pos_ff.CoreNet[0].bias,
+            layer_str + "ff/layer_2/kernel": b.pos_ff.CoreNet[3].weight,
+            layer_str + "ff/layer_2/bias": b.pos_ff.CoreNet[3].bias,
+        })
+
+    # Relative positioning biases
+    if config.untie_r:
+        r_r_list = []
+        r_w_list = []
+        for b in model.layers:
+            r_r_list.append(b.dec_attn.r_r_bias)
+            r_w_list.append(b.dec_attn.r_w_bias)
+    else:
+        r_r_list = [model.r_r_bias]
+        r_w_list = [model.r_w_bias]
+    tf_to_pt_map.update({
+        'transformer/r_r_bias': r_r_list,
+        'transformer/r_w_bias': r_w_list})
+    return tf_to_pt_map
+
+def load_tf_weights_in_transfo_xl(model, config, tf_path):
+    """ Load tf checkpoints in a pytorch model
+    """
+    try:
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        logger.error("Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions.")
+        raise
+    # Build TF to PyTorch weights loading map
+    tf_to_pt_map = build_tf_to_pytorch_map(model, config)
+
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    tf_weights = {}
+    for name, shape in init_vars:
+        logger.info("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        tf_weights[name] = array
+
+    for name, pointer in tf_to_pt_map.items():
+        assert name in tf_weights
+        array = tf_weights[name]
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if 'kernel' in name or 'proj' in name:
+            array = np.transpose(array)
+        if ('r_r_bias' in name or 'r_w_bias' in name) and len(pointer) > 1:
+            # Here we will split the TF weigths
+            assert len(pointer) == array.shape[0]
+            for i, p_i in enumerate(pointer):
+                arr_i = array[i, ...]
+                try:
+                    assert p_i.shape == arr_i.shape
+                except AssertionError as e:
+                    e.args += (p_i.shape, arr_i.shape)
+                    raise
+                logger.info("Initialize PyTorch weight {} for layer {}".format(name, i))
+                p_i.data = torch.from_numpy(arr_i)
+        else:
+            try:
+                assert pointer.shape == array.shape
+            except AssertionError as e:
+                e.args += (pointer.shape, array.shape)
+                raise
+            logger.info("Initialize PyTorch weight {}".format(name))
+            pointer.data = torch.from_numpy(array)
+        tf_weights.pop(name, None)
+        tf_weights.pop(name + '/Adam', None)
+        tf_weights.pop(name + '/Adam_1', None)
+
+    logger.info("Weights not copied to PyTorch model: {}".format(', '.join(tf_weights.keys())))
+    return model
+
+
+class PositionalEmbedding(nn.Module):
+    def __init__(self, demb):
+        super(PositionalEmbedding, self).__init__()
+
+        self.demb = demb
+
+        inv_freq = 1 / (10000 ** (torch.arange(0.0, demb, 2.0) / demb))
+        self.register_buffer('inv_freq', inv_freq)
+
+    def forward(self, pos_seq, bsz=None):
+        sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
+        pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)
+
+        if bsz is not None:
+            return pos_emb[:,None,:].expand(-1, bsz, -1)
+        else:
+            return pos_emb[:,None,:]
+
+
+
+class PositionwiseFF(nn.Module):
+    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False):
+        super(PositionwiseFF, self).__init__()
+
+        self.d_model = d_model
+        self.d_inner = d_inner
+        self.dropout = dropout
+
+        self.CoreNet = nn.Sequential(
+            nn.Linear(d_model, d_inner), nn.ReLU(inplace=True),
+            nn.Dropout(dropout),
+            nn.Linear(d_inner, d_model),
+            nn.Dropout(dropout),
+        )
+
+        self.layer_norm = nn.LayerNorm(d_model)
+
+        self.pre_lnorm = pre_lnorm
+
+    def forward(self, inp):
+        if self.pre_lnorm:
+            ##### layer normalization + positionwise feed-forward
+            core_out = self.CoreNet(self.layer_norm(inp))
+
+            ##### residual connection
+            output = core_out + inp
+        else:
+            ##### positionwise feed-forward
+            core_out = self.CoreNet(inp)
+
+            ##### residual connection + layer normalization
+            output = self.layer_norm(inp + core_out)
+
+        return output
+
+
+
+class MultiHeadAttn(nn.Module):
+    def __init__(self, n_head, d_model, d_head, dropout, dropatt=0,
+                 pre_lnorm=False, r_r_bias=None, r_w_bias=None, output_attentions=False):
+        super(MultiHeadAttn, self).__init__()
+
+        self.output_attentions = output_attentions
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_head = d_head
+        self.dropout = dropout
+
+        self.q_net = nn.Linear(d_model, n_head * d_head, bias=False)
+        self.kv_net = nn.Linear(d_model, 2 * n_head * d_head, bias=False)
+
+        self.drop = nn.Dropout(dropout)
+        self.dropatt = nn.Dropout(dropatt)
+        self.o_net = nn.Linear(n_head * d_head, d_model, bias=False)
+
+        self.layer_norm = nn.LayerNorm(d_model)
+
+        self.scale = 1 / (d_head ** 0.5)
+
+        self.pre_lnorm = pre_lnorm
+
+        if r_r_bias is None or r_w_bias is None: # Biases are not shared
+            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        else:
+            self.r_r_bias = r_r_bias
+            self.r_w_bias = r_w_bias
+
+    def forward(self, h, attn_mask=None, mems=None, head_mask=None):
+        ##### multihead attention
+        # [hlen x bsz x n_head x d_head]
+
+        if mems is not None:
+            c = torch.cat([mems, h], 0)
+        else:
+            c = h
+
+        if self.pre_lnorm:
+            ##### layer normalization
+            c = self.layer_norm(c)
+
+        head_q = self.q_net(h)
+        head_k, head_v = torch.chunk(self.kv_net(c), 2, -1)
+
+        head_q = head_q.view(h.size(0), h.size(1), self.n_head, self.d_head)
+        head_k = head_k.view(c.size(0), c.size(1), self.n_head, self.d_head)
+        head_v = head_v.view(c.size(0), c.size(1), self.n_head, self.d_head)
+
+        # [qlen x klen x bsz x n_head]
+        attn_score = torch.einsum('ibnd,jbnd->ijbn', (head_q, head_k))
+        attn_score.mul_(self.scale)
+        if attn_mask is not None and torch.sum(attn_mask).item():
+            attn_mask = (attn_mask == 1)  # Switch to bool
+            if attn_mask.dim() == 2:
+                attn_score.masked_fill_(attn_mask[None,:,:,None], -float('inf'))
+            elif attn_mask.dim() == 3:
+                attn_score.masked_fill_(attn_mask[:,:,:,None], -float('inf'))
+
+        # [qlen x klen x bsz x n_head]
+        attn_prob = F.softmax(attn_score, dim=1)
+        attn_prob = self.dropatt(attn_prob)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attn_prob = attn_prob * head_mask
+
+        # [qlen x klen x bsz x n_head] + [klen x bsz x n_head x d_head] -> [qlen x bsz x n_head x d_head]
+        attn_vec = torch.einsum('ijbn,jbnd->ibnd', (attn_prob, head_v))
+        attn_vec = attn_vec.contiguous().view(
+            attn_vec.size(0), attn_vec.size(1), self.n_head * self.d_head)
+
+        ##### linear projection
+        attn_out = self.o_net(attn_vec)
+        attn_out = self.drop(attn_out)
+
+        if self.pre_lnorm:
+            ##### residual connection
+            outputs = [h + attn_out]
+        else:
+            ##### residual connection + layer normalization
+            outputs = [self.layer_norm(h + attn_out)]
+
+        if self.output_attentions:
+            outputs.append(attn_prob)
+
+        return outputs
+
+class RelMultiHeadAttn(nn.Module):
+    def __init__(self, n_head, d_model, d_head, dropout, dropatt=0,
+                 tgt_len=None, ext_len=None, mem_len=None, pre_lnorm=False,
+                 r_r_bias=None, r_w_bias=None, output_attentions=False):
+        super(RelMultiHeadAttn, self).__init__()
+
+        self.output_attentions = output_attentions
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_head = d_head
+        self.dropout = dropout
+
+        self.qkv_net = nn.Linear(d_model, 3 * n_head * d_head, bias=False)
+
+        self.drop = nn.Dropout(dropout)
+        self.dropatt = nn.Dropout(dropatt)
+        self.o_net = nn.Linear(n_head * d_head, d_model, bias=False)
+
+        self.layer_norm = nn.LayerNorm(d_model)
+
+        self.scale = 1 / (d_head ** 0.5)
+
+        self.pre_lnorm = pre_lnorm
+
+        if r_r_bias is None or r_w_bias is None: # Biases are not shared
+            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        else:
+            self.r_r_bias = r_r_bias
+            self.r_w_bias = r_w_bias
+
+    def _parallelogram_mask(self, h, w, left=False):
+        mask = torch.ones((h, w)).byte()
+        m = min(h, w)
+        mask[:m,:m] = torch.triu(mask[:m,:m])
+        mask[-m:,-m:] = torch.tril(mask[-m:,-m:])
+
+        if left:
+            return mask
+        else:
+            return mask.flip(0)
+
+    def _shift(self, x, qlen, klen, mask, left=False):
+        if qlen > 1:
+            zero_pad = torch.zeros((x.size(0), qlen-1, x.size(2), x.size(3)),
+                                    device=x.device, dtype=x.dtype)
+        else:
+            zero_pad = torch.zeros(0, device=x.device, dtype=x.dtype)
+
+        if left:
+            mask = mask.flip(1)
+            x_padded = torch.cat([zero_pad, x], dim=1).expand(qlen, -1, -1, -1)
+        else:
+            x_padded = torch.cat([x, zero_pad], dim=1).expand(qlen, -1, -1, -1)
+
+        x = x_padded.masked_select(mask[:,:,None,None]) \
+                    .view(qlen, klen, x.size(2), x.size(3))
+
+        return x
+
+    def _rel_shift(self, x, zero_triu=False):
+        zero_pad_shape = (x.size(0), 1) + x.size()[2:]
+        zero_pad = torch.zeros(zero_pad_shape, device=x.device, dtype=x.dtype)
+        x_padded = torch.cat([zero_pad, x], dim=1)
+
+        x_padded_shape = (x.size(1) + 1, x.size(0)) + x.size()[2:]
+        x_padded = x_padded.view(*x_padded_shape)
+
+        x = x_padded[1:].view_as(x)
+
+        if zero_triu:
+            ones = torch.ones((x.size(0), x.size(1)))
+            x = x * torch.tril(ones, x.size(1) - x.size(0))[:,:,None,None]
+
+        return x
+
+    def forward(self, w, r, attn_mask=None, mems=None):
+        raise NotImplementedError
+
+class RelPartialLearnableMultiHeadAttn(RelMultiHeadAttn):
+    def __init__(self, *args, **kwargs):
+        super(RelPartialLearnableMultiHeadAttn, self).__init__(*args, **kwargs)
+
+        self.r_net = nn.Linear(self.d_model, self.n_head * self.d_head, bias=False)
+
+    def forward(self, w, r, attn_mask=None, mems=None, head_mask=None):
+        qlen, rlen, bsz = w.size(0), r.size(0), w.size(1)
+
+        if mems is not None:
+            cat = torch.cat([mems, w], 0)
+            if self.pre_lnorm:
+                w_heads = self.qkv_net(self.layer_norm(cat))
+            else:
+                w_heads = self.qkv_net(cat)
+            r_head_k = self.r_net(r)
+
+            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)
+            w_head_q = w_head_q[-qlen:]
+        else:
+            if self.pre_lnorm:
+                w_heads = self.qkv_net(self.layer_norm(w))
+            else:
+                w_heads = self.qkv_net(w)
+            r_head_k = self.r_net(r)
+
+            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)
+
+        klen = w_head_k.size(0)
+
+        w_head_q = w_head_q.view(qlen, bsz, self.n_head, self.d_head)           # qlen x bsz x n_head x d_head
+        w_head_k = w_head_k.view(klen, bsz, self.n_head, self.d_head)           # qlen x bsz x n_head x d_head
+        w_head_v = w_head_v.view(klen, bsz, self.n_head, self.d_head)           # qlen x bsz x n_head x d_head
+
+        r_head_k = r_head_k.view(rlen, self.n_head, self.d_head)                # qlen x n_head x d_head
+
+        #### compute attention score
+        rw_head_q = w_head_q + self.r_w_bias                                    # qlen x bsz x n_head x d_head
+        AC = torch.einsum('ibnd,jbnd->ijbn', (rw_head_q, w_head_k))             # qlen x klen x bsz x n_head
+
+        rr_head_q = w_head_q + self.r_r_bias
+        BD = torch.einsum('ibnd,jnd->ijbn', (rr_head_q, r_head_k))              # qlen x klen x bsz x n_head
+        BD = self._rel_shift(BD)
+
+        # [qlen x klen x bsz x n_head]
+        attn_score = AC + BD
+        attn_score.mul_(self.scale)
+
+        #### compute attention probability
+        if attn_mask is not None and torch.sum(attn_mask).item():
+            attn_mask = (attn_mask == 1)  # Switch to bool
+            if attn_mask.dim() == 2:
+                if next(self.parameters()).dtype == torch.float16:
+                    attn_score = attn_score.float().masked_fill(
+                        attn_mask[None,:,:,None], -65000).type_as(attn_score)
+                else:
+                    attn_score = attn_score.float().masked_fill(
+                        attn_mask[None,:,:,None], -1e30).type_as(attn_score)
+            elif attn_mask.dim() == 3:
+                if next(self.parameters()).dtype == torch.float16:
+                    attn_score = attn_score.float().masked_fill(
+                        attn_mask[:,:,:,None], -65000).type_as(attn_score)
+                else:
+                    attn_score = attn_score.float().masked_fill(
+                        attn_mask[:,:,:,None], -1e30).type_as(attn_score)
+
+        # [qlen x klen x bsz x n_head]
+        attn_prob = F.softmax(attn_score, dim=1)
+        attn_prob = self.dropatt(attn_prob)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attn_prob = attn_prob * head_mask
+
+        #### compute attention vector
+        attn_vec = torch.einsum('ijbn,jbnd->ibnd', (attn_prob, w_head_v))
+
+        # [qlen x bsz x n_head x d_head]
+        attn_vec = attn_vec.contiguous().view(
+            attn_vec.size(0), attn_vec.size(1), self.n_head * self.d_head)
+
+        ##### linear projection
+        attn_out = self.o_net(attn_vec)
+        attn_out = self.drop(attn_out)
+
+        if self.pre_lnorm:
+            ##### residual connection
+            outputs = [w + attn_out]
+        else:
+            ##### residual connection + layer normalization
+            outputs = [self.layer_norm(w + attn_out)]
+
+        if self.output_attentions:
+            outputs.append(attn_prob)
+
+        return outputs
+
+class RelLearnableMultiHeadAttn(RelMultiHeadAttn):
+    def __init__(self, *args, **kwargs):
+        super(RelLearnableMultiHeadAttn, self).__init__(*args, **kwargs)
+
+    def forward(self, w, r_emb, r_w_bias, r_bias, attn_mask=None, mems=None, head_mask=None):
+        # r_emb: [klen, n_head, d_head], used for term B
+        # r_w_bias: [n_head, d_head], used for term C
+        # r_bias: [klen, n_head], used for term D
+
+        qlen, bsz = w.size(0), w.size(1)
+
+        if mems is not None:
+            cat = torch.cat([mems, w], 0)
+            if self.pre_lnorm:
+                w_heads = self.qkv_net(self.layer_norm(cat))
+            else:
+                w_heads = self.qkv_net(cat)
+            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)
+
+            w_head_q = w_head_q[-qlen:]
+        else:
+            if self.pre_lnorm:
+                w_heads = self.qkv_net(self.layer_norm(w))
+            else:
+                w_heads = self.qkv_net(w)
+            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)
+
+        klen = w_head_k.size(0)
+
+        w_head_q = w_head_q.view(qlen, bsz, self.n_head, self.d_head)
+        w_head_k = w_head_k.view(klen, bsz, self.n_head, self.d_head)
+        w_head_v = w_head_v.view(klen, bsz, self.n_head, self.d_head)
+
+        if klen > r_emb.size(0):
+            r_emb_pad = r_emb[0:1].expand(klen-r_emb.size(0), -1, -1)
+            r_emb = torch.cat([r_emb_pad, r_emb], 0)
+            r_bias_pad = r_bias[0:1].expand(klen-r_bias.size(0), -1)
+            r_bias = torch.cat([r_bias_pad, r_bias], 0)
+        else:
+            r_emb = r_emb[-klen:]
+            r_bias = r_bias[-klen:]
+
+        #### compute attention score
+        rw_head_q = w_head_q + r_w_bias[None]                                   # qlen x bsz x n_head x d_head
+
+        AC = torch.einsum('ibnd,jbnd->ijbn', (rw_head_q, w_head_k))             # qlen x klen x bsz x n_head
+        B_ = torch.einsum('ibnd,jnd->ijbn', (w_head_q, r_emb))                  # qlen x klen x bsz x n_head
+        D_ = r_bias[None, :, None]                                              # 1    x klen x 1   x n_head
+        BD = self._rel_shift(B_ + D_)
+
+        # [qlen x klen x bsz x n_head]
+        attn_score = AC + BD
+        attn_score.mul_(self.scale)
+
+        #### compute attention probability
+        if attn_mask is not None and torch.sum(attn_mask).item():
+            attn_mask = (attn_mask == 1)  # Switch to bool
+            if attn_mask.dim() == 2:
+                attn_score.masked_fill_(attn_mask[None,:,:,None], -float('inf'))
+            elif attn_mask.dim() == 3:
+                attn_score.masked_fill_(attn_mask[:,:,:,None], -float('inf'))
+
+        # [qlen x klen x bsz x n_head]
+        attn_prob = F.softmax(attn_score, dim=1)
+        attn_prob = self.dropatt(attn_prob)
+
+        if head_mask is not None:
+            attn_prob = attn_prob * head_mask
+
+        #### compute attention vector
+        attn_vec = torch.einsum('ijbn,jbnd->ibnd', (attn_prob, w_head_v))
+
+        # [qlen x bsz x n_head x d_head]
+        attn_vec = attn_vec.contiguous().view(
+            attn_vec.size(0), attn_vec.size(1), self.n_head * self.d_head)
+
+        ##### linear projection
+        attn_out = self.o_net(attn_vec)
+        attn_out = self.drop(attn_out)
+
+        if self.pre_lnorm:
+            ##### residual connection
+            outputs = [w + attn_out]
+        else:
+            ##### residual connection + layer normalization
+            outputs = [self.layer_norm(w + attn_out)]
+
+        if self.output_attentions:
+            outputs.append(attn_prob)
+
+        return outputs
+
+
+
+class DecoderLayer(nn.Module):
+    def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
+        super(DecoderLayer, self).__init__()
+
+        self.dec_attn = MultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
+        self.pos_ff = PositionwiseFF(d_model, d_inner, dropout,
+                                     pre_lnorm=kwargs.get('pre_lnorm'))
+
+    def forward(self, dec_inp, dec_attn_mask=None, mems=None, head_mask=None):
+
+        attn_outputs = self.dec_attn(dec_inp, attn_mask=dec_attn_mask,
+                               mems=mems, head_mask=head_mask)
+        ff_output = self.pos_ff(attn_outputs[0])
+
+        outputs = [ff_output] + attn_outputs[1:]
+
+        return outputs
+
+class RelLearnableDecoderLayer(nn.Module):
+    def __init__(self, n_head, d_model, d_head, d_inner, dropout,
+                 **kwargs):
+        super(RelLearnableDecoderLayer, self).__init__()
+
+        self.dec_attn = RelLearnableMultiHeadAttn(n_head, d_model, d_head, dropout,
+                                         **kwargs)
+        self.pos_ff = PositionwiseFF(d_model, d_inner, dropout,
+                                     pre_lnorm=kwargs.get('pre_lnorm'))
+
+    def forward(self, dec_inp, r_emb, r_w_bias, r_bias, dec_attn_mask=None, mems=None, head_mask=None):
+
+        attn_outputs = self.dec_attn(dec_inp, r_emb, r_w_bias, r_bias,
+                               attn_mask=dec_attn_mask,
+                               mems=mems, head_mask=head_mask)
+        ff_output = self.pos_ff(attn_outputs[0])
+
+        outputs = [ff_output] + attn_outputs[1:]
+
+        return outputs
+
+class RelPartialLearnableDecoderLayer(nn.Module):
+    def __init__(self, n_head, d_model, d_head, d_inner, dropout,
+                 **kwargs):
+        super(RelPartialLearnableDecoderLayer, self).__init__()
+
+        self.dec_attn = RelPartialLearnableMultiHeadAttn(n_head, d_model,
+                            d_head, dropout, **kwargs)
+        self.pos_ff = PositionwiseFF(d_model, d_inner, dropout,
+                                     pre_lnorm=kwargs.get('pre_lnorm'))
+
+    def forward(self, dec_inp, r, dec_attn_mask=None, mems=None, head_mask=None):
+
+        attn_outputs = self.dec_attn(dec_inp, r,
+                               attn_mask=dec_attn_mask,
+                               mems=mems, head_mask=head_mask)
+        ff_output = self.pos_ff(attn_outputs[0])
+
+        outputs = [ff_output] + attn_outputs[1:]
+
+        return outputs
+
+
+
+class AdaptiveEmbedding(nn.Module):
+    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1,
+                 sample_softmax=False):
+        super(AdaptiveEmbedding, self).__init__()
+
+        self.n_token = n_token
+        self.d_embed = d_embed
+
+        self.cutoffs = cutoffs + [n_token]
+        self.div_val = div_val
+        self.d_proj = d_proj
+
+        self.emb_scale = d_proj ** 0.5
+
+        self.cutoff_ends = [0] + self.cutoffs
+
+        self.emb_layers = nn.ModuleList()
+        self.emb_projs = nn.ParameterList()
+        if div_val == 1:
+            self.emb_layers.append(
+                nn.Embedding(n_token, d_embed, sparse=sample_softmax>0)
+            )
+            if d_proj != d_embed:
+                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))
+        else:
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i+1]
+                d_emb_i = d_embed // (div_val ** i)
+                self.emb_layers.append(nn.Embedding(r_idx-l_idx, d_emb_i))
+                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))
+
+    def forward(self, inp):
+        if self.div_val == 1:
+            embed = self.emb_layers[0](inp)
+            if self.d_proj != self.d_embed:
+                embed  = F.linear(embed, self.emb_projs[0])
+        else:
+            param = next(self.parameters())
+            inp_flat = inp.view(-1)
+            emb_flat = torch.zeros([inp_flat.size(0), self.d_proj],
+                dtype=param.dtype, device=param.device)
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+
+                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)
+                indices_i = mask_i.nonzero().squeeze()
+
+                if indices_i.numel() == 0:
+                    continue
+
+                inp_i = inp_flat.index_select(0, indices_i) - l_idx
+                emb_i = self.emb_layers[i](inp_i)
+                emb_i = F.linear(emb_i, self.emb_projs[i])
+
+                emb_flat.index_copy_(0, indices_i, emb_i)
+
+            embed_shape = inp.size() + (self.d_proj,)
+            embed = emb_flat.view(embed_shape)
+
+        embed.mul_(self.emb_scale)
+
+        return embed
+
+
+class TransfoXLPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = TransfoXLConfig
+    pretrained_model_archive_map = TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = load_tf_weights_in_transfo_xl
+    base_model_prefix = "transformer"
+
+    def _init_weight(self, weight):
+        if self.config.init == 'uniform':
+            nn.init.uniform_(weight, -self.config.init_range, self.config.init_range)
+        elif self.config.init == 'normal':
+            nn.init.normal_(weight, 0.0, self.config.init_std)
+
+    def _init_bias(self, bias):
+        nn.init.constant_(bias, 0.0)
+
+    def _init_weights(self, m):
+        """ Initialize the weights.
+        """
+        classname = m.__class__.__name__
+        if classname.find('Linear') != -1:
+            if hasattr(m, 'weight') and m.weight is not None:
+                self._init_weight(m.weight)
+            if hasattr(m, 'bias') and m.bias is not None:
+                self._init_bias(m.bias)
+        elif classname.find('AdaptiveEmbedding') != -1:
+            if hasattr(m, 'emb_projs'):
+                for i in range(len(m.emb_projs)):
+                    if m.emb_projs[i] is not None:
+                        nn.init.normal_(m.emb_projs[i], 0.0, self.config.proj_init_std)
+        elif classname.find('Embedding') != -1:
+            if hasattr(m, 'weight'):
+                self._init_weight(m.weight)
+        elif classname.find('ProjectedAdaptiveLogSoftmax') != -1:
+            if hasattr(m, 'cluster_weight') and m.cluster_weight is not None:
+                self._init_weight(m.cluster_weight)
+            if hasattr(m, 'cluster_bias') and m.cluster_bias is not None:
+                self._init_bias(m.cluster_bias)
+            if hasattr(m, 'out_projs'):
+                for i in range(len(m.out_projs)):
+                    if m.out_projs[i] is not None:
+                        nn.init.normal_(m.out_projs[i], 0.0, self.config.proj_init_std)
+        elif classname.find('LayerNorm') != -1:
+            if hasattr(m, 'weight'):
+                nn.init.normal_(m.weight, 1.0, self.config.init_std)
+            if hasattr(m, 'bias') and m.bias is not None:
+                self._init_bias(m.bias)
+        else:
+            if hasattr(m, 'r_emb'):
+                self._init_weight(m.r_emb)
+            if hasattr(m, 'r_w_bias'):
+                self._init_weight(m.r_w_bias)
+            if hasattr(m, 'r_r_bias'):
+                self._init_weight(m.r_r_bias)
+            if hasattr(m, 'r_bias'):
+                self._init_bias(m.r_bias)
+
+    def set_num_special_tokens(self, num_special_tokens):
+        pass
+
+
+TRANSFO_XL_START_DOCSTRING = r"""    The Transformer-XL model was proposed in
+    `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context`_
+    by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+    It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can reuse
+    previously computed hidden-states to attend to longer context (memory).
+    This model also uses adaptive softmax inputs and outputs (tied).
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context`:
+        https://arxiv.org/abs/1901.02860
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~pytorch_transformers.TransfoXLConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+TRANSFO_XL_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            Transformer-XL is a model with relative position embeddings so you can either pad the inputs on
+            the right or on the left.
+            Indices can be obtained using :class:`pytorch_transformers.TransfoXLTokenizer`.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **mems**: (`optional`)
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            (see `mems` output below). Can be used to speed up sequential decoding and attend to longer context.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare Bert Model transformer outputting raw hidden-states without any specific head on top.",
+                      TRANSFO_XL_START_DOCSTRING, TRANSFO_XL_INPUTS_DOCSTRING)
+class TransfoXLModel(TransfoXLPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the last layer of the model.
+        **mems**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            (see `mems` input above). Can be used to speed up sequential decoding and attend to longer context.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states, mems = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(TransfoXLModel, self).__init__(config)
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        self.n_token = config.n_token
+
+        self.d_embed = config.d_embed
+        self.d_model = config.d_model
+        self.n_head = config.n_head
+        self.d_head = config.d_head
+
+        self.word_emb = AdaptiveEmbedding(config.n_token, config.d_embed, config.d_model, config.cutoffs,
+                                          div_val=config.div_val)
+
+        self.drop = nn.Dropout(config.dropout)
+
+        self.n_layer = config.n_layer
+
+        self.tgt_len = config.tgt_len
+        self.mem_len = config.mem_len
+        self.ext_len = config.ext_len
+        self.max_klen = config.tgt_len + config.ext_len + config.mem_len
+
+        self.attn_type = config.attn_type
+
+        if not config.untie_r:
+            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+
+        self.layers = nn.ModuleList()
+        if config.attn_type == 0: # the default attention
+            for i in range(config.n_layer):
+                self.layers.append(
+                    RelPartialLearnableDecoderLayer(
+                        config.n_head, config.d_model, config.d_head, config.d_inner, config.dropout,
+                        tgt_len=config.tgt_len, ext_len=config.ext_len, mem_len=config.mem_len,
+                        dropatt=config.dropatt, pre_lnorm=config.pre_lnorm,
+                        r_w_bias=None if config.untie_r else self.r_w_bias,
+                        r_r_bias=None if config.untie_r else self.r_r_bias,
+                        output_attentions=self.output_attentions)
+                )
+        elif config.attn_type == 1: # learnable embeddings
+            for i in range(config.n_layer):
+                self.layers.append(
+                    RelLearnableDecoderLayer(
+                        config.n_head, config.d_model, config.d_head, config.d_inner, config.dropout,
+                        tgt_len=config.tgt_len, ext_len=config.ext_len, mem_len=config.mem_len,
+                        dropatt=config.dropatt, pre_lnorm=config.pre_lnorm,
+                        r_w_bias=None if config.untie_r else self.r_w_bias,
+                        r_r_bias=None if config.untie_r else self.r_r_bias,
+                        output_attentions=self.output_attentions)
+                )
+        elif config.attn_type in [2, 3]: # absolute embeddings
+            for i in range(config.n_layer):
+                self.layers.append(
+                    DecoderLayer(
+                        config.n_head, config.d_model, config.d_head, config.d_inner, config.dropout,
+                        dropatt=config.dropatt, pre_lnorm=config.pre_lnorm,
+                        r_w_bias=None if config.untie_r else self.r_w_bias,
+                        r_r_bias=None if config.untie_r else self.r_r_bias,
+                        output_attentions=self.output_attentions)
+                )
+
+        self.same_length = config.same_length
+        self.clamp_len = config.clamp_len
+
+        if self.attn_type == 0: # default attention
+            self.pos_emb = PositionalEmbedding(self.d_model)
+        elif self.attn_type == 1: # learnable
+            self.r_emb = nn.Parameter(torch.FloatTensor(
+                    self.n_layer, self.max_klen, self.n_head, self.d_head))
+            self.r_bias = nn.Parameter(torch.FloatTensor(
+                    self.n_layer, self.max_klen, self.n_head))
+        elif self.attn_type == 2: # absolute standard
+            self.pos_emb = PositionalEmbedding(self.d_model)
+        elif self.attn_type == 3: # absolute deeper SA
+            self.r_emb = nn.Parameter(torch.FloatTensor(
+                    self.n_layer, self.max_klen, self.n_head, self.d_head))
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        return self.word_emb
+
+    def backward_compatible(self):
+        self.sample_softmax = -1
+
+    def reset_length(self, tgt_len, ext_len, mem_len):
+        self.tgt_len = tgt_len
+        self.mem_len = mem_len
+        self.ext_len = ext_len
+
+    def _prune_heads(self, heads):
+        logger.info("Head pruning is not implemented for Transformer-XL model")
+        pass
+
+    def init_mems(self, data):
+        if self.mem_len > 0:
+            mems = []
+            param = next(self.parameters())
+            for i in range(self.n_layer):
+                empty = torch.zeros(self.mem_len, data.size(1), self.config.d_model,
+                                    dtype=param.dtype, device=param.device)
+                mems.append(empty)
+
+            return mems
+        else:
+            return None
+
+    def _update_mems(self, hids, mems, qlen, mlen):
+        # does not deal with None
+        if mems is None: return None
+
+        # mems is not None
+        assert len(hids) == len(mems), 'len(hids) != len(mems)'
+
+        # There are `mlen + qlen` steps that can be cached into mems
+        # For the next step, the last `ext_len` of the `qlen` tokens
+        # will be used as the extended context. Hence, we only cache
+        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`
+        # to `mlen + qlen - self.ext_len`.
+        with torch.no_grad():
+            new_mems = []
+            end_idx = mlen + max(0, qlen - 0 - self.ext_len)
+            beg_idx = max(0, end_idx - self.mem_len)
+            for i in range(len(hids)):
+
+                cat = torch.cat([mems[i], hids[i]], dim=0)
+                new_mems.append(cat[beg_idx:end_idx].detach())
+
+        return new_mems
+
+    def _forward(self, dec_inp, mems=None, head_mask=None):
+        qlen, bsz = dec_inp.size()
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)
+        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)
+                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.n_layer
+
+        word_emb = self.word_emb(dec_inp)
+
+        mlen = mems[0].size(0) if mems is not None else 0
+        klen = mlen + qlen
+        if self.same_length:
+            all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)
+            mask_len = klen - self.mem_len
+            if mask_len > 0:
+                mask_shift_len = qlen - mask_len
+            else:
+                mask_shift_len = qlen
+            dec_attn_mask = (torch.triu(all_ones, 1+mlen)
+                    + torch.tril(all_ones, -mask_shift_len))[:, :, None] # -1
+        else:
+            dec_attn_mask = torch.triu(
+                word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1+mlen)[:,:,None]
+
+        hids = []
+        attentions = []
+        if self.attn_type == 0: # default
+            pos_seq = torch.arange(klen-1, -1, -1.0, device=word_emb.device,
+                                   dtype=word_emb.dtype)
+            if self.clamp_len > 0:
+                pos_seq.clamp_(max=self.clamp_len)
+            pos_emb = self.pos_emb(pos_seq)
+
+            core_out = self.drop(word_emb)
+            pos_emb = self.drop(pos_emb)
+
+            for i, layer in enumerate(self.layers):
+                hids.append(core_out)
+                mems_i = None if mems is None else mems[i]
+                layer_outputs = layer(core_out, pos_emb, dec_attn_mask=dec_attn_mask,
+                                      mems=mems_i, head_mask=head_mask[i])
+                core_out = layer_outputs[0]
+                if self.output_attentions:
+                    attentions.append(layer_outputs[1])
+        elif self.attn_type == 1: # learnable
+            core_out = self.drop(word_emb)
+            for i, layer in enumerate(self.layers):
+                hids.append(core_out)
+                if self.clamp_len > 0:
+                    r_emb = self.r_emb[i][-self.clamp_len :]
+                    r_bias = self.r_bias[i][-self.clamp_len :]
+                else:
+                    r_emb, r_bias = self.r_emb[i], self.r_bias[i]
+
+                mems_i = None if mems is None else mems[i]
+                layer_outputs = layer(core_out, r_emb, self.r_w_bias[i],
+                                      r_bias, dec_attn_mask=dec_attn_mask,
+                                      mems=mems_i, head_mask=head_mask[i])
+                core_out = layer_outputs[0]
+                if self.output_attentions:
+                    attentions.append(layer_outputs[1])
+        elif self.attn_type == 2: # absolute
+            pos_seq = torch.arange(klen - 1, -1, -1.0, device=word_emb.device,
+                                   dtype=word_emb.dtype)
+            if self.clamp_len > 0:
+                pos_seq.clamp_(max=self.clamp_len)
+            pos_emb = self.pos_emb(pos_seq)
+
+            core_out = self.drop(word_emb + pos_emb[-qlen:])
+
+            for i, layer in enumerate(self.layers):
+                hids.append(core_out)
+                mems_i = None if mems is None else mems[i]
+                if mems_i is not None and i == 0:
+                    mems_i += pos_emb[:mlen]
+                layer_outputs = layer(core_out, dec_attn_mask=dec_attn_mask,
+                                 mems=mems_i, head_mask=head_mask[i])
+                core_out = layer_outputs[0]
+                if self.output_attentions:
+                    attentions.append(layer_outputs[1])
+        elif self.attn_type == 3:
+            core_out = self.drop(word_emb)
+
+            for i, layer in enumerate(self.layers):
+                hids.append(core_out)
+                mems_i = None if mems is None else mems[i]
+                if mems_i is not None and mlen > 0:
+                    cur_emb = self.r_emb[i][:-qlen]
+                    cur_size = cur_emb.size(0)
+                    if cur_size < mlen:
+                        cur_emb_pad = cur_emb[0:1].expand(mlen-cur_size, -1, -1)
+                        cur_emb = torch.cat([cur_emb_pad, cur_emb], 0)
+                    else:
+                        cur_emb = cur_emb[-mlen:]
+                    mems_i += cur_emb.view(mlen, 1, -1)
+                core_out += self.r_emb[i][-qlen:].view(qlen, 1, -1)
+
+                layer_outputs = layer(core_out, dec_attn_mask=dec_attn_mask,
+                                      mems=mems_i, head_mask=head_mask[i])
+                core_out = layer_outputs[0]
+                if self.output_attentions:
+                    attentions.append(layer_outputs[1])
+
+        core_out = self.drop(core_out)
+
+        new_mems = self._update_mems(hids, mems, mlen, qlen)
+
+        # We transpose back here to shape [bsz, len, hidden_dim]
+        outputs = [core_out.transpose(0, 1).contiguous(), new_mems]
+        if self.output_hidden_states:
+            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]
+            hids.append(core_out)
+            hids = list(t.transpose(0, 1).contiguous() for t in hids)
+            outputs.append(hids)
+        if self.output_attentions:
+            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]
+            attentions = list(t.permute(2, 3, 0, 1).contiguous() for t in attentions)
+            outputs.append(attentions)
+        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)
+
+    def forward(self, input_ids, mems=None, head_mask=None):
+        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library
+        # so we transpose here from shape [bsz, len] to shape [len, bsz]
+        input_ids = input_ids.transpose(0, 1).contiguous()
+
+        if mems is None:
+            mems = self.init_mems(input_ids)
+        outputs = self._forward(input_ids, mems=mems, head_mask=head_mask)
+
+        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)
+
+
+@add_start_docstrings("""The Transformer-XL Model with a language modeling head on top
+    (adaptive softmax with weights tied to the adaptive input embeddings)""",
+    TRANSFO_XL_START_DOCSTRING, TRANSFO_XL_INPUTS_DOCSTRING)
+class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
+    r"""
+        **lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **prediction_scores**: ``None`` if ``lm_labels`` is provided else ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+            We don't output them when the loss is computed to speedup adaptive softmax decoding.
+        **mems**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            (see `mems` input above). Can be used to speed up sequential decoding and attend to longer context.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        prediction_scores, mems = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(TransfoXLLMHeadModel, self).__init__(config)
+        self.transformer = TransfoXLModel(config)
+        self.sample_softmax = config.sample_softmax
+        # use sampled softmax
+        if config.sample_softmax > 0:
+            self.out_layer = nn.Linear(config.d_model, config.n_token)
+            self.sampler = LogUniformSampler(config.n_token, config.sample_softmax)
+        # use adaptive softmax (including standard softmax)
+        else:
+            self.crit = ProjectedAdaptiveLogSoftmax(config.n_token, config.d_embed, config.d_model,
+                                                    config.cutoffs, div_val=config.div_val)
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """
+        Run this to be sure output and input (adaptive) softmax weights are tied
+        """
+        # sampled softmax
+        if self.sample_softmax > 0:
+            if self.config.tie_weight:
+                self.out_layer.weight = self.transformer.word_emb.weight
+        # adaptive softmax (including standard softmax)
+        else:
+            if self.config.tie_weight:
+                for i in range(len(self.crit.out_layers)):
+                    self._tie_or_clone_weights(self.crit.out_layers[i],
+                                               self.transformer.word_emb.emb_layers[i])
+            if self.config.tie_projs:
+                for i, tie_proj in enumerate(self.config.tie_projs):
+                    if tie_proj and self.config.div_val == 1 and self.config.d_model != self.config.d_embed:
+                        if self.config.torchscript:
+                            self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[0].clone())
+                        else:
+                            self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[0]
+                    elif tie_proj and self.config.div_val != 1:
+                        if self.config.torchscript:
+                            self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[i].clone())
+                        else:
+                            self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[i]
+
+    def reset_length(self, tgt_len, ext_len, mem_len):
+        self.transformer.reset_length(tgt_len, ext_len, mem_len)
+
+    def init_mems(self, data):
+        return self.transformer.init_mems(data)
+
+    def forward(self, input_ids, mems=None, head_mask=None, labels=None):
+        bsz = input_ids.size(0)
+        tgt_len = input_ids.size(1)
+
+        transformer_outputs = self.transformer(input_ids, mems=mems, head_mask=head_mask)
+
+        last_hidden = transformer_outputs[0]
+        pred_hid = last_hidden[:, -tgt_len:]
+        outputs = transformer_outputs[1:]
+        if self.sample_softmax > 0 and self.training:
+            assert self.config.tie_weight
+            logit = sample_logits(self.transformer.word_emb, self.out_layer.bias, labels, pred_hid, self.sampler)
+            softmax_output = -F.log_softmax(logit, -1)[:, :, 0]
+            outputs = [softmax_output] + outputs
+            if labels is not None:
+                # TODO: This is not implemented
+                raise NotImplementedError
+        else:
+            softmax_output = self.crit(pred_hid.view(-1, pred_hid.size(-1)), labels)
+            if labels is None:
+                softmax_output = softmax_output.view(bsz, tgt_len, -1)
+                outputs = [softmax_output] + outputs
+            else:
+                softmax_output = softmax_output.view(bsz, tgt_len)
+                outputs = [softmax_output, None] + outputs
+
+        return outputs  # (loss), logits or None if labels is not None (speed up adaptive softmax), new_mems, (all hidden states), (all attentions)
diff --git a/Optimus/code/pytorch_transformers/modeling_transfo_xl_utilities.py b/Optimus/code/pytorch_transformers/modeling_transfo_xl_utilities.py
new file mode 100755
index 0000000000000000000000000000000000000000..0773d0d5fca418918c50d730f1da37c1bc7f98a1
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_transfo_xl_utilities.py
@@ -0,0 +1,332 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Utilities for PyTorch Transformer XL model.
+    Directly adapted from https://github.com/kimiyoung/transformer-xl.
+"""
+
+from collections import defaultdict
+
+import numpy as np
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+# CUDA_MAJOR = int(torch.version.cuda.split('.')[0])
+# CUDA_MINOR = int(torch.version.cuda.split('.')[1])
+
+class ProjectedAdaptiveLogSoftmax(nn.Module):
+    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1,
+                 keep_order=False):
+        super(ProjectedAdaptiveLogSoftmax, self).__init__()
+
+        self.n_token = n_token
+        self.d_embed = d_embed
+        self.d_proj = d_proj
+
+        self.cutoffs = cutoffs + [n_token]
+        self.cutoff_ends = [0] + self.cutoffs
+        self.div_val = div_val
+
+        self.shortlist_size = self.cutoffs[0]
+        self.n_clusters = len(self.cutoffs) - 1
+        self.head_size = self.shortlist_size + self.n_clusters
+
+        if self.n_clusters > 0:
+            self.cluster_weight = nn.Parameter(torch.zeros(self.n_clusters, self.d_embed))
+            self.cluster_bias = nn.Parameter(torch.zeros(self.n_clusters))
+
+        self.out_layers = nn.ModuleList()
+        self.out_projs = nn.ParameterList()
+
+        if div_val == 1:
+            for i in range(len(self.cutoffs)):
+                if d_proj != d_embed:
+                    self.out_projs.append(
+                        nn.Parameter(torch.FloatTensor(d_proj, d_embed))
+                    )
+                else:
+                    self.out_projs.append(None)
+
+            self.out_layers.append(nn.Linear(d_embed, n_token))
+        else:
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i+1]
+                d_emb_i = d_embed // (div_val ** i)
+
+                self.out_projs.append(
+                    nn.Parameter(torch.FloatTensor(d_proj, d_emb_i))
+                )
+
+                self.out_layers.append(nn.Linear(d_emb_i, r_idx-l_idx))
+
+        self.keep_order = keep_order
+
+    def _compute_logit(self, hidden, weight, bias, proj):
+        if proj is None:
+            logit = F.linear(hidden, weight, bias=bias)
+        else:
+            # if CUDA_MAJOR <= 9 and CUDA_MINOR <= 1:
+            proj_hid = F.linear(hidden, proj.t().contiguous())
+            logit = F.linear(proj_hid, weight, bias=bias)
+            # else:
+            #     logit = torch.einsum('bd,de,ev->bv', (hidden, proj, weight.t()))
+            #     if bias is not None:
+            #         logit = logit + bias
+
+        return logit
+
+    def forward(self, hidden, labels=None, keep_order=False):
+        '''
+            Params:
+                hidden :: [len*bsz x d_proj]
+                labels :: [len*bsz]
+            Return:
+                if labels is None:
+                    out :: [len*bsz] Negative log likelihood
+                else:
+                    out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary
+            We could replace this implementation by the native PyTorch one
+            if their's had an option to set bias on all clusters in the native one.
+            here: https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138
+        '''
+
+        if labels is not None:
+            labels = labels.view(-1)
+            if hidden.size(0) != labels.size(0):
+                raise RuntimeError('Input and labels should have the same size '
+                                'in the batch dimension.')
+
+        if self.n_clusters == 0:
+            logit = self._compute_logit(hidden, self.out_layers[0].weight,
+                                        self.out_layers[0].bias, self.out_projs[0])
+            if labels is not None:
+                out = -F.log_softmax(logit, dim=-1) \
+                        .gather(1, labels.unsqueeze(1)).squeeze(1)
+            else:
+                out = F.log_softmax(logit, dim=-1)
+        else:
+            # construct weights and biases
+            weights, biases = [], []
+            for i in range(len(self.cutoffs)):
+                if self.div_val == 1:
+                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+                    weight_i = self.out_layers[0].weight[l_idx:r_idx]
+                    bias_i = self.out_layers[0].bias[l_idx:r_idx]
+                else:
+                    weight_i = self.out_layers[i].weight
+                    bias_i = self.out_layers[i].bias
+
+                if i == 0:
+                    weight_i = torch.cat(
+                        [weight_i, self.cluster_weight], dim=0)
+                    bias_i = torch.cat(
+                        [bias_i, self.cluster_bias], dim=0)
+
+                weights.append(weight_i)
+                biases.append(bias_i)
+
+            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]
+
+            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)
+            head_logprob = F.log_softmax(head_logit, dim=1)
+
+            if labels is None:
+                out = hidden.new_empty((head_logit.size(0), self.n_token))
+            else:
+                out = torch.zeros_like(labels, dtype=hidden.dtype, device=hidden.device)
+
+            offset = 0
+            cutoff_values = [0] + self.cutoffs
+            for i in range(len(cutoff_values) - 1):
+                l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]
+
+                if labels is not None:
+                    mask_i = (labels >= l_idx) & (labels < r_idx)
+                    indices_i = mask_i.nonzero().squeeze()
+
+                    if indices_i.numel() == 0:
+                        continue
+
+                    target_i = labels.index_select(0, indices_i) - l_idx
+                    head_logprob_i = head_logprob.index_select(0, indices_i)
+                    hidden_i = hidden.index_select(0, indices_i)
+                else:
+                    hidden_i = hidden
+
+                if i == 0:
+                    if labels is not None:
+                        logprob_i = head_logprob_i.gather(1, target_i[:, None]).squeeze(1)
+                    else:
+                        out[:, :self.cutoffs[0]] = head_logprob[:, :self.cutoffs[0]]
+                else:
+                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]
+
+                    tail_logit_i = self._compute_logit(hidden_i, weight_i, bias_i, proj_i)
+                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)
+                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster
+                    if labels is not None:
+                        logprob_i = head_logprob_i[:, cluster_prob_idx] \
+                                + tail_logprob_i.gather(1, target_i[:, None]).squeeze(1)
+                    else:
+                        logprob_i = head_logprob[:, cluster_prob_idx, None] + tail_logprob_i
+                        out[:, l_idx:r_idx] = logprob_i
+
+                if labels is not None:
+                    if (hasattr(self, 'keep_order') and self.keep_order) or keep_order:
+                        out.index_copy_(0, indices_i, -logprob_i)
+                    else:
+                        out[offset:offset+logprob_i.size(0)].copy_(-logprob_i)
+                    offset += logprob_i.size(0)
+
+        return out
+
+
+    def log_prob(self, hidden):
+        r""" Computes log probabilities for all :math:`n\_classes`
+        From: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/adaptive.py
+        Args:
+            hidden (Tensor): a minibatch of examples
+        Returns:
+            log-probabilities of for each class :math:`c`
+            in range :math:`0 <= c <= n\_classes`, where :math:`n\_classes` is a
+            parameter passed to ``AdaptiveLogSoftmaxWithLoss`` constructor.
+        Shape:
+            - Input: :math:`(N, in\_features)`
+            - Output: :math:`(N, n\_classes)`
+        """
+        if self.n_clusters == 0:
+            logit = self._compute_logit(hidden, self.out_layers[0].weight,
+                                        self.out_layers[0].bias, self.out_projs[0])
+            return F.log_softmax(logit, dim=-1)
+        else:
+            # construct weights and biases
+            weights, biases = [], []
+            for i in range(len(self.cutoffs)):
+                if self.div_val == 1:
+                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+                    weight_i = self.out_layers[0].weight[l_idx:r_idx]
+                    bias_i = self.out_layers[0].bias[l_idx:r_idx]
+                else:
+                    weight_i = self.out_layers[i].weight
+                    bias_i = self.out_layers[i].bias
+
+                if i == 0:
+                    weight_i = torch.cat(
+                        [weight_i, self.cluster_weight], dim=0)
+                    bias_i = torch.cat(
+                        [bias_i, self.cluster_bias], dim=0)
+
+                weights.append(weight_i)
+                biases.append(bias_i)
+
+            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]
+            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)
+
+            out = hidden.new_empty((head_logit.size(0), self.n_token))
+            head_logprob = F.log_softmax(head_logit, dim=1)
+
+            cutoff_values = [0] + self.cutoffs
+            for i in range(len(cutoff_values) - 1):
+                start_idx, stop_idx = cutoff_values[i], cutoff_values[i + 1]
+
+                if i == 0:
+                    out[:, :self.cutoffs[0]] = head_logprob[:, :self.cutoffs[0]]
+                else:
+                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]
+
+                    tail_logit_i = self._compute_logit(hidden, weight_i, bias_i, proj_i)
+                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)
+
+                    logprob_i = head_logprob[:, -i] + tail_logprob_i
+                    out[:, start_idx, stop_idx] = logprob_i
+
+            return out
+
+
+class LogUniformSampler(object):
+    def __init__(self, range_max, n_sample):
+        """
+        Reference : https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/python/ops/candidate_sampling_ops.py
+            `P(class) = (log(class + 2) - log(class + 1)) / log(range_max + 1)`
+
+        expected count can be approximated by 1 - (1 - p)^n
+        and we use a numerically stable version -expm1(num_tries * log1p(-p))
+
+        Our implementation fixes num_tries at 2 * n_sample, and the actual #samples will vary from run to run
+        """
+        with torch.no_grad():
+            self.range_max = range_max
+            log_indices = torch.arange(1., range_max+2., 1.).log_()
+            self.dist = (log_indices[1:] - log_indices[:-1]) / log_indices[-1]
+
+            self.log_q = (- (-self.dist.double().log1p_() * 2 * n_sample).expm1_()).log_().float()
+
+        self.n_sample = n_sample
+
+    def sample(self, labels):
+        """
+            labels: [b1, b2]
+        Return
+            true_log_probs: [b1, b2]
+            samp_log_probs: [n_sample]
+            neg_samples: [n_sample]
+        """
+
+        # neg_samples = torch.empty(0).long()
+        n_sample = self.n_sample
+        n_tries = 2 * n_sample
+
+        with torch.no_grad():
+            neg_samples = torch.multinomial(self.dist, n_tries, replacement=True).unique()
+            device = labels.device
+            neg_samples = neg_samples.to(device)
+            true_log_probs = self.log_q[labels].to(device)
+            samp_log_probs = self.log_q[neg_samples].to(device)
+            return true_log_probs, samp_log_probs, neg_samples
+
+def sample_logits(embedding, bias, labels, inputs, sampler):
+    """
+        embedding: an nn.Embedding layer
+        bias: [n_vocab]
+        labels: [b1, b2]
+        inputs: [b1, b2, n_emb]
+        sampler: you may use a LogUniformSampler
+    Return
+        logits: [b1, b2, 1 + n_sample]
+    """
+    true_log_probs, samp_log_probs, neg_samples = sampler.sample(labels)
+    n_sample = neg_samples.size(0)
+    b1, b2 = labels.size(0), labels.size(1)
+    all_ids = torch.cat([labels.view(-1), neg_samples])
+    all_w = embedding(all_ids)
+    true_w = all_w[: -n_sample].view(b1, b2, -1)
+    sample_w = all_w[- n_sample:].view(n_sample, -1)
+
+    all_b = bias[all_ids]
+    true_b = all_b[: -n_sample].view(b1, b2)
+    sample_b = all_b[- n_sample:]
+
+    hit = (labels[:, :, None] == neg_samples).detach()
+
+    true_logits = torch.einsum('ijk,ijk->ij',
+        [true_w, inputs]) + true_b - true_log_probs
+    sample_logits = torch.einsum('lk,ijk->ijl',
+        [sample_w, inputs]) + sample_b - samp_log_probs
+    sample_logits.masked_fill_(hit, -1e30)
+    logits = torch.cat([true_logits[:, :, None], sample_logits], -1)
+
+    return logits
diff --git a/Optimus/code/pytorch_transformers/modeling_utils.py b/Optimus/code/pytorch_transformers/modeling_utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..cb28ebe17bf57c0cb36f166e04874dbaa4b1e6ea
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_utils.py
@@ -0,0 +1,780 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch BERT model."""
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+
+import pdb
+import copy
+import json
+import logging
+import os
+from io import open
+
+import six
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from torch.nn import functional as F
+
+from .configuration_utils import PretrainedConfig
+from .file_utils import cached_path, WEIGHTS_NAME, TF_WEIGHTS_NAME
+
+logger = logging.getLogger(__name__)
+
+
+try:
+    from torch.nn import Identity
+except ImportError:
+    # Older PyTorch compatibility
+    class Identity(nn.Module):
+        r"""A placeholder identity operator that is argument-insensitive.
+        """
+        def __init__(self, *args, **kwargs):
+            super(Identity, self).__init__()
+
+        def forward(self, input):
+            return input
+
+class PreTrainedModel(nn.Module):
+    r""" Base class for all models.
+
+        :class:`~pytorch_transformers.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models
+        as well as a few methods commons to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.
+
+        Class attributes (overridden by derived classes):
+            - ``config_class``: a class derived from :class:`~pytorch_transformers.PretrainedConfig` to use as configuration class for this model architecture.
+            - ``pretrained_model_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained weights as values.
+            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:
+
+                - ``model``: an instance of the relevant subclass of :class:`~pytorch_transformers.PreTrainedModel`,
+                - ``config``: an instance of the relevant subclass of :class:`~pytorch_transformers.PretrainedConfig`,
+                - ``path``: a path (string) to the TensorFlow checkpoint.
+
+            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.
+    """
+    config_class = None
+    pretrained_model_archive_map = {}
+    load_tf_weights = lambda model, config, path: None
+    base_model_prefix = ""
+
+    def __init__(self, config, *inputs, **kwargs):
+        super(PreTrainedModel, self).__init__()
+        if not isinstance(config, PretrainedConfig):
+            raise ValueError(
+                "Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. "
+                "To create a model from a pretrained model use "
+                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
+                    self.__class__.__name__, self.__class__.__name__
+                ))
+        # Save config in model
+        self.config = config
+
+    def _get_resized_embeddings(self, old_embeddings, new_num_tokens=None):
+        """ Build a resized Embedding Module from a provided token Embedding Module.
+            Increasing the size will add newly initialized vectors at the end
+            Reducing the size will remove vectors from the end
+
+        Args:
+            new_num_tokens: (`optional`) int
+                New number of tokens in the embedding matrix.
+                Increasing the size will add newly initialized vectors at the end
+                Reducing the size will remove vectors from the end
+                If not provided or None: return the provided token Embedding Module.
+        Return: ``torch.nn.Embeddings``
+            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None
+        """
+        if new_num_tokens is None:
+            return old_embeddings
+
+        old_num_tokens, old_embedding_dim = old_embeddings.weight.size()
+        if old_num_tokens == new_num_tokens:
+            return old_embeddings
+
+        # Build new embeddings
+        new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)
+        new_embeddings.to(old_embeddings.weight.device)
+
+        # initialize all new embeddings (in particular added tokens)
+        self._init_weights(new_embeddings)
+
+        # Copy word embeddings from the previous weights
+        num_tokens_to_copy = min(old_num_tokens, new_num_tokens)
+        new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]
+
+        return new_embeddings
+
+    def _tie_or_clone_weights(self, first_module, second_module):
+        """ Tie or clone module weights depending of weither we are using TorchScript or not
+        """
+        if self.config.torchscript:
+            first_module.weight = nn.Parameter(second_module.weight.clone())
+        else:
+            first_module.weight = second_module.weight
+
+        if hasattr(first_module, 'bias') and first_module.bias is not None:
+            first_module.bias.data = torch.nn.functional.pad(
+                first_module.bias.data,
+                (0, first_module.weight.shape[0] - first_module.bias.shape[0]),
+                'constant',
+                0
+            )
+
+    def resize_token_embeddings(self, new_num_tokens=None):
+        """ Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
+        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.
+
+        Arguments:
+
+            new_num_tokens: (`optional`) int:
+                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.
+                If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.
+
+        Return: ``torch.nn.Embeddings``
+            Pointer to the input tokens Embeddings Module of the model
+        """
+
+
+        base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed
+        
+        model_embeds = base_model._resize_token_embeddings(new_num_tokens)
+        if new_num_tokens is None:
+            return model_embeds
+
+        # Update base model and current model config
+        self.config.vocab_size = new_num_tokens
+        base_model.vocab_size = new_num_tokens
+
+        # Tie weights again if needed
+        if hasattr(self, 'tie_weights'):
+            self.tie_weights()
+
+        return model_embeds
+
+    def init_weights(self):
+        """ Initialize and prunes weights if needed. """
+        # Initialize weights
+        self.apply(self._init_weights)
+
+        # Prune heads if needed
+        if self.config.pruned_heads:
+            self.prune_heads(self.config.pruned_heads)
+
+    def prune_heads(self, heads_to_prune):
+        """ Prunes heads of the base model.
+
+            Arguments:
+
+                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).
+                E.g. {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.
+        """
+        base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed
+
+        # save new sets of pruned heads as union of previously stored pruned heads and newly pruned heads
+        for layer, heads in heads_to_prune.items():
+            union_heads = set(self.config.pruned_heads.get(layer, [])) | set(heads)
+            self.config.pruned_heads[layer] = list(union_heads)  # Unfortunately we have to store it as list for JSON
+
+        base_model._prune_heads(heads_to_prune)
+
+    def save_pretrained(self, save_directory):
+        """ Save a model and its configuration file to a directory, so that it
+            can be re-loaded using the `:func:`~pytorch_transformers.PreTrainedModel.from_pretrained`` class method.
+        """
+        assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
+
+        # Only save the model it-self if we are using distributed training
+        model_to_save = self.module if hasattr(self, 'module') else self
+
+        # Save configuration file
+        model_to_save.config.save_pretrained(save_directory)
+
+        # If we save using the predefined names, we can load using `from_pretrained`
+        output_model_file = os.path.join(save_directory, WEIGHTS_NAME)
+
+        torch.save(model_to_save.state_dict(), output_model_file)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r"""Instantiate a pretrained pytorch model from a pre-trained model configuration.
+
+        The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)
+        To train the model, you should first set it back in training mode with ``model.train()``
+
+        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.
+        It is up to you to train those weights with a downstream fine-tuning task.
+
+        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.
+
+        Parameters:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
+                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+
+            model_args: (`optional`) Sequence of positional arguments:
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+
+            config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
+                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
+
+                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
+                - the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
+                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
+
+            state_dict: (`optional`) dict:
+                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
+                In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            output_loading_info: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+
+            kwargs: (`optional`) Remaining dictionary of keyword arguments:
+                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
+
+                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
+                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+
+        Examples::
+
+            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')
+            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+
+        """
+        config = kwargs.pop('config', None)
+        state_dict = kwargs.pop('state_dict', None)
+        cache_dir = kwargs.pop('cache_dir', None)
+        from_tf = kwargs.pop('from_tf', False)
+        force_download = kwargs.pop('force_download', False)
+        proxies = kwargs.pop('proxies', None)
+        output_loading_info = kwargs.pop('output_loading_info', False)
+
+        # Load config
+        if config is None:
+            config, model_kwargs = cls.config_class.from_pretrained(
+                pretrained_model_name_or_path, *model_args,
+                cache_dir=cache_dir, return_unused_kwargs=True,
+                force_download=force_download,
+                **kwargs
+            )
+        else:
+            model_kwargs = kwargs
+
+        # Load model
+        if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
+            archive_file = cls.pretrained_model_archive_map[pretrained_model_name_or_path]
+        elif os.path.isdir(pretrained_model_name_or_path):
+            if from_tf:
+                # Directly load from a TensorFlow checkpoint
+                archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + ".index")
+            else:
+                archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)
+        else:
+            if from_tf:
+                # Directly load from a TensorFlow checkpoint
+                archive_file = pretrained_model_name_or_path + ".index"
+            else:
+                archive_file = pretrained_model_name_or_path
+        # redirect to the cache, if necessary
+        try:
+            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
+        except EnvironmentError as e:
+            if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
+                logger.error(
+                    "Couldn't reach server at '{}' to download pretrained weights.".format(
+                        archive_file))
+            else:
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find any file "
+                    "associated to this path or url.".format(
+                        pretrained_model_name_or_path,
+                        ', '.join(cls.pretrained_model_archive_map.keys()),
+                        archive_file))
+            raise e
+        if resolved_archive_file == archive_file:
+            logger.info("loading weights file {}".format(archive_file))
+        else:
+            logger.info("loading weights file {} from cache at {}".format(
+                archive_file, resolved_archive_file))
+
+        # Instantiate model.
+        model = cls(config, *model_args, **model_kwargs)
+
+        if state_dict is None and not from_tf:
+            state_dict = torch.load(resolved_archive_file, map_location='cpu')
+        if from_tf:
+            # Directly load from a TensorFlow checkpoint
+            return cls.load_tf_weights(model, config, resolved_archive_file[:-6])  # Remove the '.index'
+
+        # Convert old format to new format if needed from a PyTorch state_dict
+        old_keys = []
+        new_keys = []
+        for key in state_dict.keys():
+            new_key = None
+            if 'gamma' in key:
+                new_key = key.replace('gamma', 'weight')
+            if 'beta' in key:
+                new_key = key.replace('beta', 'bias')
+            if new_key:
+                old_keys.append(key)
+                new_keys.append(new_key)
+        for old_key, new_key in zip(old_keys, new_keys):
+            state_dict[new_key] = state_dict.pop(old_key)
+
+        # Load from a PyTorch state_dict
+        missing_keys = []
+        unexpected_keys = []
+        error_msgs = []
+        # copy state_dict so _load_from_state_dict can modify it
+        metadata = getattr(state_dict, '_metadata', None)
+        state_dict = state_dict.copy()
+        if metadata is not None:
+            state_dict._metadata = metadata
+
+        def load(module, prefix=''):
+            local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
+            module._load_from_state_dict(
+                state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
+            for name, child in module._modules.items():
+                if child is not None:
+                    load(child, prefix + name + '.')
+
+        # Make sure we are able to load base models as well as derived models (with heads)
+        start_prefix = ''
+        model_to_load = model
+        if not hasattr(model, cls.base_model_prefix) and any(s.startswith(cls.base_model_prefix) for s in state_dict.keys()):
+            start_prefix = cls.base_model_prefix + '.'
+        if hasattr(model, cls.base_model_prefix) and not any(s.startswith(cls.base_model_prefix) for s in state_dict.keys()):
+            model_to_load = getattr(model, cls.base_model_prefix)
+
+        load(model_to_load, prefix=start_prefix)
+        if len(missing_keys) > 0:
+            logger.info("Weights of {} not initialized from pretrained model: {}".format(
+                model.__class__.__name__, missing_keys))
+        if len(unexpected_keys) > 0:
+            logger.info("Weights from pretrained model not used in {}: {}".format(
+                model.__class__.__name__, unexpected_keys))
+        if len(error_msgs) > 0:
+            raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
+                               model.__class__.__name__, "\n\t".join(error_msgs)))
+
+        if hasattr(model, 'tie_weights'):
+            model.tie_weights()  # make sure word embedding weights are still tied
+
+        # Set model in evaluation mode to desactivate DropOut modules by default
+        model.eval()
+
+        if output_loading_info:
+            loading_info = {"missing_keys": missing_keys, "unexpected_keys": unexpected_keys, "error_msgs": error_msgs}
+            return model, loading_info
+
+        return model
+
+
+class Conv1D(nn.Module):
+    def __init__(self, nf, nx):
+        """ Conv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)
+            Basically works like a Linear layer but the weights are transposed
+        """
+        super(Conv1D, self).__init__()
+        self.nf = nf
+        w = torch.empty(nx, nf)
+        nn.init.normal_(w, std=0.02)
+        self.weight = nn.Parameter(w)
+        self.bias = nn.Parameter(torch.zeros(nf))
+
+    def forward(self, x):
+        size_out = x.size()[:-1] + (self.nf,)
+        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
+        x = x.view(*size_out)
+        return x
+
+
+class PoolerStartLogits(nn.Module):
+    """ Compute SQuAD start_logits from sequence hidden states. """
+    def __init__(self, config):
+        super(PoolerStartLogits, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, hidden_states, p_mask=None):
+        """ Args:
+            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape `(batch_size, seq_len)`
+                invalid position mask such as query and special symbols (PAD, SEP, CLS)
+                1.0 means token should be masked.
+        """
+        x = self.dense(hidden_states).squeeze(-1)
+
+        if p_mask is not None:
+            if next(self.parameters()).dtype == torch.float16:
+                x = x * (1 - p_mask) - 65500 * p_mask
+            else:
+                x = x * (1 - p_mask) - 1e30 * p_mask
+
+        return x
+
+
+class PoolerEndLogits(nn.Module):
+    """ Compute SQuAD end_logits from sequence hidden states and start token hidden state.
+    """
+    def __init__(self, config):
+        super(PoolerEndLogits, self).__init__()
+        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dense_1 = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, hidden_states, start_states=None, start_positions=None, p_mask=None):
+        """ Args:
+            One of ``start_states``, ``start_positions`` should be not None.
+            If both are set, ``start_positions`` overrides ``start_states``.
+
+            **start_states**: ``torch.LongTensor`` of shape identical to hidden_states
+                hidden states of the first tokens for the labeled span.
+            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``
+                position of the first token for the labeled span:
+            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``
+                Mask of invalid position such as query and special symbols (PAD, SEP, CLS)
+                1.0 means token should be masked.
+        """
+        assert start_states is not None or start_positions is not None, "One of start_states, start_positions should be not None"
+        if start_positions is not None:
+            slen, hsz = hidden_states.shape[-2:]
+            start_positions = start_positions[:, None, None].expand(-1, -1, hsz) # shape (bsz, 1, hsz)
+            start_states = hidden_states.gather(-2, start_positions) # shape (bsz, 1, hsz)
+            start_states = start_states.expand(-1, slen, -1) # shape (bsz, slen, hsz)
+
+        x = self.dense_0(torch.cat([hidden_states, start_states], dim=-1))
+        x = self.activation(x)
+        x = self.LayerNorm(x)
+        x = self.dense_1(x).squeeze(-1)
+
+        if p_mask is not None:
+            x = x * (1 - p_mask) - 1e30 * p_mask
+
+        return x
+
+
+class PoolerAnswerClass(nn.Module):
+    """ Compute SQuAD 2.0 answer class from classification and start tokens hidden states. """
+    def __init__(self, config):
+        super(PoolerAnswerClass, self).__init__()
+        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.dense_1 = nn.Linear(config.hidden_size, 1, bias=False)
+
+    def forward(self, hidden_states, start_states=None, start_positions=None, cls_index=None):
+        """
+        Args:
+            One of ``start_states``, ``start_positions`` should be not None.
+            If both are set, ``start_positions`` overrides ``start_states``.
+
+            **start_states**: ``torch.LongTensor`` of shape identical to ``hidden_states``.
+                hidden states of the first tokens for the labeled span.
+            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``
+                position of the first token for the labeled span.
+            **cls_index**: torch.LongTensor of shape ``(batch_size,)``
+                position of the CLS token. If None, take the last token.
+
+            note(Original repo):
+                no dependency on end_feature so that we can obtain one single `cls_logits`
+                for each sample
+        """
+        hsz = hidden_states.shape[-1]
+        assert start_states is not None or start_positions is not None, "One of start_states, start_positions should be not None"
+        if start_positions is not None:
+            start_positions = start_positions[:, None, None].expand(-1, -1, hsz) # shape (bsz, 1, hsz)
+            start_states = hidden_states.gather(-2, start_positions).squeeze(-2) # shape (bsz, hsz)
+
+        if cls_index is not None:
+            cls_index = cls_index[:, None, None].expand(-1, -1, hsz) # shape (bsz, 1, hsz)
+            cls_token_state = hidden_states.gather(-2, cls_index).squeeze(-2) # shape (bsz, hsz)
+        else:
+            cls_token_state = hidden_states[:, -1, :] # shape (bsz, hsz)
+
+        x = self.dense_0(torch.cat([start_states, cls_token_state], dim=-1))
+        x = self.activation(x)
+        x = self.dense_1(x).squeeze(-1)
+
+        return x
+
+
+class SQuADHead(nn.Module):
+    r""" A SQuAD head inspired by XLNet.
+
+    Parameters:
+        config (:class:`~pytorch_transformers.XLNetConfig`): Model configuration class with all the parameters of the model.
+
+    Inputs:
+        **hidden_states**: ``torch.FloatTensor`` of shape ``(batch_size, seq_len, hidden_size)``
+            hidden states of sequence tokens
+        **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``
+            position of the first token for the labeled span.
+        **end_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``
+            position of the last token for the labeled span.
+        **cls_index**: torch.LongTensor of shape ``(batch_size,)``
+            position of the CLS token. If None, take the last token.
+        **is_impossible**: ``torch.LongTensor`` of shape ``(batch_size,)``
+            Whether the question has a possible answer in the paragraph or not.
+        **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``
+            Mask of invalid position such as query and special symbols (PAD, SEP, CLS)
+            1.0 means token should be masked.
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned if both ``start_positions`` and ``end_positions`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.
+        **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``
+            Log probabilities for the top config.start_n_top start token possibilities (beam-search).
+        **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``
+            Indices for the top config.start_n_top start token possibilities (beam-search).
+        **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``
+            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).
+        **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``
+            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).
+        **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.FloatTensor`` of shape ``(batch_size,)``
+            Log probabilities for the ``is_impossible`` label of the answers.
+    """
+    def __init__(self, config):
+        super(SQuADHead, self).__init__()
+        self.start_n_top = config.start_n_top
+        self.end_n_top = config.end_n_top
+
+        self.start_logits = PoolerStartLogits(config)
+        self.end_logits = PoolerEndLogits(config)
+        self.answer_class = PoolerAnswerClass(config)
+
+    def forward(self, hidden_states, start_positions=None, end_positions=None,
+                cls_index=None, is_impossible=None, p_mask=None):
+        outputs = ()
+
+        start_logits = self.start_logits(hidden_states, p_mask=p_mask)
+
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, let's remove the dimension added by batch splitting
+            for x in (start_positions, end_positions, cls_index, is_impossible):
+                if x is not None and x.dim() > 1:
+                    x.squeeze_(-1)
+
+            # during training, compute the end logits based on the ground truth of the start position
+            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)
+
+            loss_fct = CrossEntropyLoss()
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+            if cls_index is not None and is_impossible is not None:
+                # Predict answerability from the representation of CLS and START
+                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)
+                loss_fct_cls = nn.BCEWithLogitsLoss()
+                cls_loss = loss_fct_cls(cls_logits, is_impossible)
+
+                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss
+                total_loss += cls_loss * 0.5
+
+            outputs = (total_loss,) + outputs
+
+        else:
+            # during inference, compute the end logits based on beam search
+            bsz, slen, hsz = hidden_states.size()
+            start_log_probs = F.softmax(start_logits, dim=-1) # shape (bsz, slen)
+
+            start_top_log_probs, start_top_index = torch.topk(start_log_probs, self.start_n_top, dim=-1) # shape (bsz, start_n_top)
+            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz) # shape (bsz, start_n_top, hsz)
+            start_states = torch.gather(hidden_states, -2, start_top_index_exp) # shape (bsz, start_n_top, hsz)
+            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1) # shape (bsz, slen, start_n_top, hsz)
+
+            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(start_states) # shape (bsz, slen, start_n_top, hsz)
+            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None
+            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)
+            end_log_probs = F.softmax(end_logits, dim=1) # shape (bsz, slen, start_n_top)
+
+            end_top_log_probs, end_top_index = torch.topk(end_log_probs, self.end_n_top, dim=1) # shape (bsz, end_n_top, start_n_top)
+            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)
+            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)
+
+            start_states = torch.einsum("blh,bl->bh", hidden_states, start_log_probs)
+            cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)
+
+            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs
+
+        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits
+        # or (if labels are provided) (total_loss,)
+        return outputs
+
+
+class SequenceSummary(nn.Module):
+    r""" Compute a single vector summary of a sequence hidden states according to various possibilities:
+        Args of the config class:
+            summary_type:
+                - 'last' => [default] take the last token hidden state (like XLNet)
+                - 'first' => take the first token hidden state (like Bert)
+                - 'mean' => take the mean of all tokens hidden states
+                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
+                - 'attn' => Not implemented now, use multi-head attention
+            summary_use_proj: Add a projection after the vector extraction
+            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
+            summary_activation: 'tanh' => add a tanh activation to the output, Other => no activation. Default
+            summary_first_dropout: Add a dropout before the projection and activation
+            summary_last_dropout: Add a dropout after the projection and activation
+    """
+    def __init__(self, config):
+        super(SequenceSummary, self).__init__()
+
+        self.summary_type = config.summary_type if hasattr(config, 'summary_use_proj') else 'last'
+        if self.summary_type == 'attn':
+            # We should use a standard multi-head attention module with absolute positional embedding for that.
+            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276
+            # We can probably just use the multi-head attention module of PyTorch >=1.1.0
+            raise NotImplementedError
+
+        self.summary = Identity()
+        if hasattr(config, 'summary_use_proj') and config.summary_use_proj:
+            if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
+                num_classes = config.num_labels
+            else:
+                num_classes = config.hidden_size
+            self.summary = nn.Linear(config.hidden_size, num_classes)
+
+        self.activation = Identity()
+        if hasattr(config, 'summary_activation') and config.summary_activation == 'tanh':
+            self.activation = nn.Tanh()
+
+        self.first_dropout = Identity()
+        if hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0:
+            self.first_dropout = nn.Dropout(config.summary_first_dropout)
+
+        self.last_dropout = Identity()
+        if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
+            self.last_dropout = nn.Dropout(config.summary_last_dropout)
+
+    def forward(self, hidden_states, cls_index=None):
+        """ hidden_states: float Tensor in shape [bsz, seq_len, hidden_size], the hidden-states of the last layer.
+            cls_index: [optional] position of the classification token if summary_type == 'cls_index',
+                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.
+                if summary_type == 'cls_index' and cls_index is None:
+                    we take the last token of the sequence as classification token
+        """
+        if self.summary_type == 'last':
+            output = hidden_states[:, -1]
+        elif self.summary_type == 'first':
+            output = hidden_states[:, 0]
+        elif self.summary_type == 'mean':
+            output = hidden_states.mean(dim=1)
+        elif self.summary_type == 'cls_index':
+            if cls_index is None:
+                cls_index = torch.full_like(hidden_states[..., :1, :], hidden_states.shape[-2]-1, dtype=torch.long)
+            else:
+                cls_index = cls_index.unsqueeze(-1).unsqueeze(-1)
+                cls_index = cls_index.expand((-1,) * (cls_index.dim()-1) + (hidden_states.size(-1),))
+            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states
+            output = hidden_states.gather(-2, cls_index).squeeze(-2) # shape (bsz, XX, hidden_size)
+        elif self.summary_type == 'attn':
+            raise NotImplementedError
+
+        output = self.first_dropout(output)
+        output = self.summary(output)
+        output = self.activation(output)
+        output = self.last_dropout(output)
+
+        return output
+
+
+def prune_linear_layer(layer, index, dim=0):
+    """ Prune a linear layer (a model parameters) to keep only entries in index.
+        Return the pruned layer as a new layer with requires_grad=True.
+        Used to remove heads.
+    """
+    index = index.to(layer.weight.device)
+    W = layer.weight.index_select(dim, index).clone().detach()
+    if layer.bias is not None:
+        if dim == 1:
+            b = layer.bias.clone().detach()
+        else:
+            b = layer.bias[index].clone().detach()
+    new_size = list(layer.weight.size())
+    new_size[dim] = len(index)
+    new_layer = nn.Linear(new_size[1], new_size[0], bias=layer.bias is not None).to(layer.weight.device)
+    new_layer.weight.requires_grad = False
+    new_layer.weight.copy_(W.contiguous())
+    new_layer.weight.requires_grad = True
+    if layer.bias is not None:
+        new_layer.bias.requires_grad = False
+        new_layer.bias.copy_(b.contiguous())
+        new_layer.bias.requires_grad = True
+    return new_layer
+
+
+def prune_conv1d_layer(layer, index, dim=1):
+    """ Prune a Conv1D layer (a model parameters) to keep only entries in index.
+        A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed.
+        Return the pruned layer as a new layer with requires_grad=True.
+        Used to remove heads.
+    """
+    index = index.to(layer.weight.device)
+    W = layer.weight.index_select(dim, index).clone().detach()
+    if dim == 0:
+        b = layer.bias.clone().detach()
+    else:
+        b = layer.bias[index].clone().detach()
+    new_size = list(layer.weight.size())
+    new_size[dim] = len(index)
+    new_layer = Conv1D(new_size[1], new_size[0]).to(layer.weight.device)
+    new_layer.weight.requires_grad = False
+    new_layer.weight.copy_(W.contiguous())
+    new_layer.weight.requires_grad = True
+    new_layer.bias.requires_grad = False
+    new_layer.bias.copy_(b.contiguous())
+    new_layer.bias.requires_grad = True
+    return new_layer
+
+
+def prune_layer(layer, index, dim=None):
+    """ Prune a Conv1D or nn.Linear layer (a model parameters) to keep only entries in index.
+        Return the pruned layer as a new layer with requires_grad=True.
+        Used to remove heads.
+    """
+    if isinstance(layer, nn.Linear):
+        return prune_linear_layer(layer, index, dim=0 if dim is None else dim)
+    elif isinstance(layer, Conv1D):
+        return prune_conv1d_layer(layer, index, dim=1 if dim is None else dim)
+    else:
+        raise ValueError("Can't prune layer of class {}".format(layer.__class__))
diff --git a/Optimus/code/pytorch_transformers/modeling_xlm.py b/Optimus/code/pytorch_transformers/modeling_xlm.py
new file mode 100755
index 0000000000000000000000000000000000000000..67866a30ddcf7449a22f3a7ab32a9d8a0d2d6071
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_xlm.py
@@ -0,0 +1,796 @@
+# coding=utf-8
+# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch XLM model.
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+import math
+
+import itertools
+import numpy as np
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.nn import CrossEntropyLoss, MSELoss
+
+from .modeling_utils import PreTrainedModel, prune_linear_layer, SequenceSummary, SQuADHead
+from .configuration_xlm import XLMConfig
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+XLM_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'xlm-mlm-en-2048': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-pytorch_model.bin",
+    'xlm-mlm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-pytorch_model.bin",
+    'xlm-mlm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-pytorch_model.bin",
+    'xlm-mlm-enro-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-pytorch_model.bin",
+    'xlm-mlm-tlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-pytorch_model.bin",
+    'xlm-mlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-pytorch_model.bin",
+    'xlm-clm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-pytorch_model.bin",
+    'xlm-clm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-pytorch_model.bin",
+    'xlm-mlm-17-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-pytorch_model.bin",
+    'xlm-mlm-100-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-pytorch_model.bin",
+}
+
+
+def create_sinusoidal_embeddings(n_pos, dim, out):
+    position_enc = np.array([
+        [pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
+        for pos in range(n_pos)
+    ])
+    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
+    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
+    out.detach_()
+    out.requires_grad = False
+
+
+def gelu(x):
+    """
+    GELU activation
+    https://arxiv.org/abs/1606.08415
+    https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/model_pytorch.py#L14
+    https://github.com/huggingface/pytorch-transformers/blob/master/modeling.py
+    """
+    # return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+    return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+def get_masks(slen, lengths, causal, padding_mask=None):
+    """
+    Generate hidden states mask, and optionally an attention mask.
+    """
+    bs = lengths.size(0)
+    if padding_mask is not None:
+        mask = padding_mask
+    else:
+        assert lengths.max().item() <= slen
+        alen = torch.arange(slen, dtype=torch.long, device=lengths.device)
+        mask = alen < lengths[:, None]
+
+    # attention mask is the same as mask, or triangular inferior attention (causal)
+    if causal:
+        attn_mask = alen[None, None, :].repeat(bs, slen, 1) <= alen[None, :, None]
+    else:
+        attn_mask = mask
+
+    # sanity check
+    assert mask.size() == (bs, slen)
+    assert causal is False or attn_mask.size() == (bs, slen, slen)
+
+    return mask, attn_mask
+
+
+class MultiHeadAttention(nn.Module):
+
+    NEW_ID = itertools.count()
+
+    def __init__(self, n_heads, dim, config):
+        super(MultiHeadAttention, self).__init__()
+        self.layer_id = next(MultiHeadAttention.NEW_ID)
+        self.output_attentions = config.output_attentions
+        self.dim = dim
+        self.n_heads = n_heads
+        self.dropout = config.attention_dropout
+        assert self.dim % self.n_heads == 0
+
+        self.q_lin = nn.Linear(dim, dim)
+        self.k_lin = nn.Linear(dim, dim)
+        self.v_lin = nn.Linear(dim, dim)
+        self.out_lin = nn.Linear(dim, dim)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        attention_head_size = self.dim // self.n_heads
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.n_heads, attention_head_size)
+        heads = set(heads) - self.pruned_heads
+        for head in heads:
+            head -= sum(1 if h < head else 0 for h in self.pruned_heads)
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        # Prune linear layers
+        self.q_lin = prune_linear_layer(self.q_lin, index)
+        self.k_lin = prune_linear_layer(self.k_lin, index)
+        self.v_lin = prune_linear_layer(self.v_lin, index)
+        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)
+        # Update hyper params
+        self.n_heads = self.n_heads - len(heads)
+        self.dim = attention_head_size * self.n_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(self, input, mask, kv=None, cache=None, head_mask=None):
+        """
+        Self-attention (if kv is None) or attention over source sentence (provided by kv).
+        """
+        # Input is (bs, qlen, dim)
+        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)
+        bs, qlen, dim = input.size()
+        if kv is None:
+            klen = qlen if cache is None else cache['slen'] + qlen
+        else:
+            klen = kv.size(1)
+        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)
+        n_heads = self.n_heads
+        dim_per_head = self.dim // n_heads
+        mask_reshape = (bs, 1, qlen, klen) if mask.dim() == 3 else (bs, 1, 1, klen)
+
+        def shape(x):
+            """  projection """
+            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)
+
+        def unshape(x):
+            """  compute context """
+            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)
+
+        q = shape(self.q_lin(input))                                          # (bs, n_heads, qlen, dim_per_head)
+        if kv is None:
+            k = shape(self.k_lin(input))                                      # (bs, n_heads, qlen, dim_per_head)
+            v = shape(self.v_lin(input))                                      # (bs, n_heads, qlen, dim_per_head)
+        elif cache is None or self.layer_id not in cache:
+            k = v = kv
+            k = shape(self.k_lin(k))                                          # (bs, n_heads, qlen, dim_per_head)
+            v = shape(self.v_lin(v))                                          # (bs, n_heads, qlen, dim_per_head)
+
+        if cache is not None:
+            if self.layer_id in cache:
+                if kv is None:
+                    k_, v_ = cache[self.layer_id]
+                    k = torch.cat([k_, k], dim=2)                             # (bs, n_heads, klen, dim_per_head)
+                    v = torch.cat([v_, v], dim=2)                             # (bs, n_heads, klen, dim_per_head)
+                else:
+                    k, v = cache[self.layer_id]
+            cache[self.layer_id] = (k, v)
+
+        q = q / math.sqrt(dim_per_head)                                       # (bs, n_heads, qlen, dim_per_head)
+        scores = torch.matmul(q, k.transpose(2, 3))                           # (bs, n_heads, qlen, klen)
+        mask = (mask == 0).view(mask_reshape).expand_as(scores)               # (bs, n_heads, qlen, klen)
+        scores.masked_fill_(mask, -float('inf'))                              # (bs, n_heads, qlen, klen)
+
+        weights = F.softmax(scores.float(), dim=-1).type_as(scores)           # (bs, n_heads, qlen, klen)
+        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            weights = weights * head_mask
+
+        context = torch.matmul(weights, v)                                    # (bs, n_heads, qlen, dim_per_head)
+        context = unshape(context)                                            # (bs, qlen, dim)
+
+        outputs = (self.out_lin(context),)
+        if self.output_attentions:
+            outputs = outputs + (weights,)
+        return outputs
+
+
+class TransformerFFN(nn.Module):
+
+    def __init__(self, in_dim, dim_hidden, out_dim, config):
+        super(TransformerFFN, self).__init__()
+        self.dropout = config.dropout
+        self.lin1 = nn.Linear(in_dim, dim_hidden)
+        self.lin2 = nn.Linear(dim_hidden, out_dim)
+        self.act = gelu if config.gelu_activation else F.relu
+
+    def forward(self, input):
+        x = self.lin1(input)
+        x = self.act(x)
+        x = self.lin2(x)
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        return x
+
+
+class XLMPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = XLMConfig
+    pretrained_model_archive_map = XLM_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = None
+    base_model_prefix = "transformer"
+
+    def __init__(self, *inputs, **kwargs):
+        super(XLMPreTrainedModel, self).__init__(*inputs, **kwargs)
+
+    def _init_weights(self, module):
+        """ Initialize the weights. """
+        if isinstance(module, nn.Embedding):
+            if self.config is not None and self.config.embed_init_std is not None:
+                nn.init.normal_(module.weight, mean=0, std=self.config.embed_init_std)
+        if isinstance(module, nn.Linear):
+            if self.config is not None and self.config.init_std is not None:
+                nn.init.normal_(module.weight, mean=0, std=self.config.init_std)
+                if hasattr(module, 'bias') and module.bias is not None:
+                    nn.init.constant_(module.bias, 0.)
+        if isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+XLM_START_DOCSTRING = r"""    The XLM model was proposed in
+    `Cross-lingual Language Model Pretraining`_
+    by Guillaume Lample*, Alexis Conneau*. It's a transformer pre-trained using one of the following objectives:
+
+        - a causal language modeling (CLM) objective (next token prediction),
+        - a masked language modeling (MLM) objective (Bert-like), or
+        - a Translation Language Modeling (TLM) object (extension of Bert's MLM to multiple language inputs)
+
+    Original code can be found `here`_.
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`Cross-lingual Language Model Pretraining`:
+        https://arxiv.org/abs/1901.07291
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    .. _`here`:
+        https://github.com/facebookresearch/XLM
+
+    Parameters:
+        config (:class:`~pytorch_transformers.XLMConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+XLM_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+
+            XLM is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+
+            Indices can be obtained using :class:`pytorch_transformers.XLMTokenizer`.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **langs**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens to be used to indicate the language of each token in the input.
+            Indices are languages ids which can be obtained from the language names by using two conversion mappings
+            provided in the configuration of the model (only provided for multilingual models).
+            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and
+            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens (can be used to indicate various portions of the inputs).
+            The embeddings from these tokens will be summed with the respective token embeddings.
+            Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
+        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+        **lengths**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Length of each sentence that can be used to avoid performing attention on padding token indices.
+            You can also use `attention_mask` for the same result (see above), kept here for compatbility.
+            Indices selected in ``[0, ..., input_ids.size(-1)]``:
+        **cache**:
+            dictionary with ``torch.FloatTensor`` that contains pre-computed
+            hidden-states (key and values in the attention blocks) as computed by the model
+            (see `cache` output below). Can be used to speed up sequential decoding.
+            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare XLM Model transformer outputting raw hidden-states without any specific head on top.",
+                      XLM_START_DOCSTRING, XLM_INPUTS_DOCSTRING)
+class XLMModel(XLMPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the last layer of the model.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMModel.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    ATTRIBUTES = ['encoder', 'eos_index', 'pad_index',  # 'with_output', 
+                  'n_langs', 'use_lang_emb', 'n_words', 'dim', 'n_layers', 'n_heads', 
+                  'hidden_dim', 'dropout', 'attention_dropout', 'asm',
+                  'asm_cutoffs', 'asm_div_value']
+
+    def __init__(self, config):  #, dico, is_encoder, with_output):
+        super(XLMModel, self).__init__(config)
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        # encoder / decoder, output layer
+        self.is_encoder = config.is_encoder
+        self.is_decoder = not config.is_encoder
+        if self.is_decoder:
+            raise NotImplementedError("Currently XLM can only be used as an encoder")
+        # self.with_output = with_output
+        self.causal = config.causal
+
+        # dictionary / languages
+        self.n_langs = config.n_langs
+        self.use_lang_emb = config.use_lang_emb
+        self.n_words = config.n_words
+        self.eos_index = config.eos_index
+        self.pad_index = config.pad_index
+        # self.dico = dico
+        # self.id2lang = config.id2lang
+        # self.lang2id = config.lang2id
+        # assert len(self.dico) == self.n_words
+        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs
+
+        # model parameters
+        self.dim = config.emb_dim       # 512 by default
+        self.hidden_dim = self.dim * 4  # 2048 by default
+        self.n_heads = config.n_heads   # 8 by default
+        self.n_layers = config.n_layers
+        self.dropout = config.dropout
+        self.attention_dropout = config.attention_dropout
+        assert self.dim % self.n_heads == 0, 'transformer dim must be a multiple of n_heads'
+
+        # embeddings
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, self.dim)
+        if config.sinusoidal_embeddings:
+            create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)
+        if config.n_langs > 1 and config.use_lang_emb:
+            self.lang_embeddings = nn.Embedding(self.n_langs, self.dim)
+        self.embeddings = nn.Embedding(self.n_words, self.dim, padding_idx=self.pad_index)
+        self.layer_norm_emb = nn.LayerNorm(self.dim, eps=config.layer_norm_eps)
+
+        # transformer layers
+        self.attentions = nn.ModuleList()
+        self.layer_norm1 = nn.ModuleList()
+        self.ffns = nn.ModuleList()
+        self.layer_norm2 = nn.ModuleList()
+        # if self.is_decoder:
+        #     self.layer_norm15 = nn.ModuleList()
+        #     self.encoder_attn = nn.ModuleList()
+
+        for _ in range(self.n_layers):
+            self.attentions.append(MultiHeadAttention(self.n_heads, self.dim, config=config))
+            self.layer_norm1.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))
+            # if self.is_decoder:
+            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))
+            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))
+            self.ffns.append(TransformerFFN(self.dim, self.hidden_dim, self.dim, config=config))
+            self.layer_norm2.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))
+
+        if hasattr(config, "pruned_heads"):
+            pruned_heads = config.pruned_heads.copy().items()
+            config.pruned_heads = {}
+            for layer, heads in pruned_heads:
+                if self.attentions[int(layer)].n_heads == config.n_heads:
+                    self.prune_heads({int(layer): list(map(int, heads))})
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        self.embeddings = self._get_resized_embeddings(self.embeddings, new_num_tokens)
+        return self.embeddings
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+            See base class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.attentions[layer].prune_heads(heads)
+
+    def forward(self, input_ids, attention_mask=None, langs=None, token_type_ids=None, position_ids=None,
+                lengths=None, cache=None, head_mask=None):  # removed: src_enc=None, src_len=None
+        if lengths is None:
+            lengths = (input_ids != self.pad_index).sum(dim=1).long()
+        # mask = input_ids != self.pad_index
+
+        # check inputs
+        bs, slen = input_ids.size()
+        assert lengths.size(0) == bs
+        assert lengths.max().item() <= slen
+        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0
+        # assert (src_enc is None) == (src_len is None)
+        # if src_enc is not None:
+        #     assert self.is_decoder
+        #     assert src_enc.size(0) == bs
+
+        # generate masks
+        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)
+        # if self.is_decoder and src_enc is not None:
+        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]
+
+        # position_ids
+        if position_ids is None:
+            position_ids = input_ids.new((slen,)).long()
+            position_ids = torch.arange(slen, out=position_ids).unsqueeze(0)
+        else:
+            assert position_ids.size() == (bs, slen)  # (slen, bs)
+            # position_ids = position_ids.transpose(0, 1)
+
+        # langs
+        if langs is not None:
+            assert langs.size() == (bs, slen)  # (slen, bs)
+            # langs = langs.transpose(0, 1)
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.n_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.n_layers
+
+        # do not recompute cached elements
+        if cache is not None:
+            _slen = slen - cache['slen']
+            input_ids = input_ids[:, -_slen:]
+            position_ids = position_ids[:, -_slen:]
+            if langs is not None:
+                langs = langs[:, -_slen:]
+            mask = mask[:, -_slen:]
+            attn_mask = attn_mask[:, -_slen:]
+
+        # embeddings
+        tensor = self.embeddings(input_ids)
+        tensor = tensor + self.position_embeddings(position_ids).expand_as(tensor)
+        if langs is not None and self.use_lang_emb:
+            tensor = tensor + self.lang_embeddings(langs)
+        if token_type_ids is not None:
+            tensor = tensor + self.embeddings(token_type_ids)
+        tensor = self.layer_norm_emb(tensor)
+        tensor = F.dropout(tensor, p=self.dropout, training=self.training)
+        tensor *= mask.unsqueeze(-1).to(tensor.dtype)
+
+        # transformer layers
+        hidden_states = ()
+        attentions = ()
+        for i in range(self.n_layers):
+            if self.output_hidden_states:
+                hidden_states = hidden_states + (tensor,)
+
+            # self attention
+            attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])
+            attn = attn_outputs[0]
+            if self.output_attentions:
+                attentions = attentions + (attn_outputs[1],)
+            attn = F.dropout(attn, p=self.dropout, training=self.training)
+            tensor = tensor + attn
+            tensor = self.layer_norm1[i](tensor)
+
+            # encoder attention (for decoder only)
+            # if self.is_decoder and src_enc is not None:
+            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)
+            #     attn = F.dropout(attn, p=self.dropout, training=self.training)
+            #     tensor = tensor + attn
+            #     tensor = self.layer_norm15[i](tensor)
+
+            # FFN
+            tensor = tensor + self.ffns[i](tensor)
+            tensor = self.layer_norm2[i](tensor)
+            tensor *= mask.unsqueeze(-1).to(tensor.dtype)
+
+        # Add last hidden state
+        if self.output_hidden_states:
+            hidden_states = hidden_states + (tensor,)
+
+        # update cache length
+        if cache is not None:
+            cache['slen'] += tensor.size(1)
+
+        # move back sequence length to dimension 0
+        # tensor = tensor.transpose(0, 1)
+
+        outputs = (tensor,)
+        if self.output_hidden_states:
+            outputs = outputs + (hidden_states,)
+        if self.output_attentions:
+            outputs = outputs + (attentions,)
+        return outputs  # outputs, (hidden_states), (attentions)
+
+
+class XLMPredLayer(nn.Module):
+    """
+    Prediction layer (cross_entropy or adaptive_softmax).
+    """
+    def __init__(self, config):
+        super(XLMPredLayer, self).__init__()
+        self.asm = config.asm
+        self.n_words = config.n_words
+        self.pad_index = config.pad_index
+        dim = config.emb_dim
+
+        if config.asm is False:
+            self.proj = nn.Linear(dim, config.n_words, bias=True)
+        else:
+            self.proj = nn.AdaptiveLogSoftmaxWithLoss(
+                in_features=dim,
+                n_classes=config.n_words,
+                cutoffs=config.asm_cutoffs,
+                div_value=config.asm_div_value,
+                head_bias=True,  # default is False
+            )
+
+    def forward(self, x, y=None):
+        """ Compute the loss, and optionally the scores.
+        """
+        outputs = ()
+        if self.asm is False:
+            scores = self.proj(x).view(-1, self.n_words)
+            outputs = (scores,) + outputs
+            if y is not None:
+                loss = F.cross_entropy(scores, y, reduction='elementwise_mean')
+                outputs = (loss,) + outputs
+        else:
+            scores = self.proj.log_prob(x)
+            outputs = (scores,) + outputs
+            if y is not None:
+                _, loss = self.proj(x, y)
+                outputs = (loss,) + outputs
+
+        return outputs
+
+
+@add_start_docstrings("""The XLM Model transformer with a language modeling head on top
+    (linear layer with weights tied to the input embeddings). """,
+    XLM_START_DOCSTRING, XLM_INPUTS_DOCSTRING)
+class XLMWithLMHeadModel(XLMPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config):
+        super(XLMWithLMHeadModel, self).__init__(config)
+        self.transformer = XLMModel(config)
+        self.pred_layer = XLMPredLayer(config)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the embeddings
+        """
+        self._tie_or_clone_weights(self.pred_layer.proj, self.transformer.embeddings)
+
+    def forward(self, input_ids, attention_mask=None, langs=None, token_type_ids=None, position_ids=None,
+                lengths=None, cache=None, head_mask=None, labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               langs=langs,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               lengths=lengths, 
+                                               cache=cache,
+                                               head_mask=head_mask)
+
+        output = transformer_outputs[0]
+        outputs = self.pred_layer(output, labels)
+        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here
+
+        return outputs
+
+
+@add_start_docstrings("""XLM Model with a sequence classification/regression head on top (a linear layer on top of
+    the pooled output) e.g. for GLUE tasks. """,
+    XLM_START_DOCSTRING, XLM_INPUTS_DOCSTRING)
+class XLMForSequenceClassification(XLMPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(XLMForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.transformer = XLMModel(config)
+        self.sequence_summary = SequenceSummary(config)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, langs=None, token_type_ids=None, position_ids=None,
+                lengths=None, cache=None, head_mask=None, labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               langs=langs,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               lengths=lengths, 
+                                               cache=cache,
+                                               head_mask=head_mask)
+
+        output = transformer_outputs[0]
+        logits = self.sequence_summary(output)
+
+        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here
+
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs
+
+
+@add_start_docstrings("""XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    XLM_START_DOCSTRING, XLM_INPUTS_DOCSTRING)
+class XLMForQuestionAnswering(XLMPreTrainedModel):
+    r"""
+        **start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        **end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        **is_impossible**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels whether a question has an answer or no answer (SQuAD 2.0)
+        **cls_index**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
+        **p_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...) 
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        **start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-start scores (before SoftMax).
+        **end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-end scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(XLMForQuestionAnswering, self).__init__(config)
+
+        self.transformer = XLMModel(config)
+        self.qa_outputs = SQuADHead(config)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, langs=None, token_type_ids=None, position_ids=None,
+                lengths=None, cache=None, head_mask=None, start_positions=None, end_positions=None,
+                is_impossible=None, cls_index=None, p_mask=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               langs=langs,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               lengths=lengths, 
+                                               cache=cache,
+                                               head_mask=head_mask)
+
+        output = transformer_outputs[0]
+
+        outputs = self.qa_outputs(output, start_positions=start_positions, end_positions=end_positions,
+                                  cls_index=cls_index, is_impossible=is_impossible, p_mask=p_mask)
+
+        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here
+
+        return outputs
diff --git a/Optimus/code/pytorch_transformers/modeling_xlnet.py b/Optimus/code/pytorch_transformers/modeling_xlnet.py
new file mode 100755
index 0000000000000000000000000000000000000000..a4a300e0706d3b2a1f12ea8c9acdc058804c394a
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/modeling_xlnet.py
@@ -0,0 +1,1248 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch XLNet model.
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import math
+import os
+import sys
+from io import open
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.nn import CrossEntropyLoss, MSELoss
+
+from .modeling_utils import PreTrainedModel, prune_linear_layer, SequenceSummary, PoolerAnswerClass, PoolerEndLogits, PoolerStartLogits
+from .configuration_xlnet import XLNetConfig
+from .file_utils import add_start_docstrings
+
+
+logger = logging.getLogger(__name__)
+
+XLNET_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'xlnet-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-pytorch_model.bin",
+    'xlnet-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-pytorch_model.bin",
+}
+
+
+def build_tf_xlnet_to_pytorch_map(model, config, tf_weights=None):
+    """ A map of modules from TF to PyTorch.
+        I use a map to keep the PyTorch model as
+        identical to the original PyTorch model as possible.
+    """
+
+    tf_to_pt_map = {}
+
+    if hasattr(model, 'transformer'):
+        if hasattr(model, 'lm_loss'):
+            # We will load also the output bias
+            tf_to_pt_map['model/lm_loss/bias'] = model.lm_loss.bias
+        if hasattr(model, 'sequence_summary') and 'model/sequnece_summary/summary/kernel' in tf_weights:
+            # We will load also the sequence summary
+            tf_to_pt_map['model/sequnece_summary/summary/kernel'] = model.sequence_summary.summary.weight
+            tf_to_pt_map['model/sequnece_summary/summary/bias'] = model.sequence_summary.summary.bias
+        if hasattr(model, 'logits_proj') and config.finetuning_task is not None \
+                and 'model/regression_{}/logit/kernel'.format(config.finetuning_task) in tf_weights:
+            tf_to_pt_map['model/regression_{}/logit/kernel'.format(config.finetuning_task)] = model.logits_proj.weight
+            tf_to_pt_map['model/regression_{}/logit/bias'.format(config.finetuning_task)] = model.logits_proj.bias
+
+        # Now load the rest of the transformer
+        model = model.transformer
+
+    # Embeddings and output
+    tf_to_pt_map.update({'model/transformer/word_embedding/lookup_table': model.word_embedding.weight,
+                         'model/transformer/mask_emb/mask_emb': model.mask_emb})
+
+    # Transformer blocks
+    for i, b in enumerate(model.layer):
+        layer_str = "model/transformer/layer_%d/" % i
+        tf_to_pt_map.update({
+            layer_str + "rel_attn/LayerNorm/gamma": b.rel_attn.layer_norm.weight,
+            layer_str + "rel_attn/LayerNorm/beta": b.rel_attn.layer_norm.bias,
+            layer_str + "rel_attn/o/kernel": b.rel_attn.o,
+            layer_str + "rel_attn/q/kernel": b.rel_attn.q,
+            layer_str + "rel_attn/k/kernel": b.rel_attn.k,
+            layer_str + "rel_attn/r/kernel": b.rel_attn.r,
+            layer_str + "rel_attn/v/kernel": b.rel_attn.v,
+            layer_str + "ff/LayerNorm/gamma": b.ff.layer_norm.weight,
+            layer_str + "ff/LayerNorm/beta": b.ff.layer_norm.bias,
+            layer_str + "ff/layer_1/kernel": b.ff.layer_1.weight,
+            layer_str + "ff/layer_1/bias": b.ff.layer_1.bias,
+            layer_str + "ff/layer_2/kernel": b.ff.layer_2.weight,
+            layer_str + "ff/layer_2/bias": b.ff.layer_2.bias,
+        })
+
+    # Relative positioning biases
+    if config.untie_r:
+        r_r_list = []
+        r_w_list = []
+        r_s_list = []
+        seg_embed_list = []
+        for b in model.layer:
+            r_r_list.append(b.rel_attn.r_r_bias)
+            r_w_list.append(b.rel_attn.r_w_bias)
+            r_s_list.append(b.rel_attn.r_s_bias)
+            seg_embed_list.append(b.rel_attn.seg_embed)
+    else:
+        r_r_list = [model.r_r_bias]
+        r_w_list = [model.r_w_bias]
+        r_s_list = [model.r_s_bias]
+        seg_embed_list = [model.seg_embed]
+    tf_to_pt_map.update({
+        'model/transformer/r_r_bias': r_r_list,
+        'model/transformer/r_w_bias': r_w_list,
+        'model/transformer/r_s_bias': r_s_list,
+        'model/transformer/seg_embed': seg_embed_list})
+    return tf_to_pt_map
+
+def load_tf_weights_in_xlnet(model, config, tf_path):
+    """ Load tf checkpoints in a pytorch model
+    """
+    try:
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        logger.error("Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions.")
+        raise
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    tf_weights = {}
+    for name, shape in init_vars:
+        logger.info("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        tf_weights[name] = array
+
+    # Build TF to PyTorch weights loading map
+    tf_to_pt_map = build_tf_xlnet_to_pytorch_map(model, config, tf_weights)
+
+    for name, pointer in tf_to_pt_map.items():
+        logger.info("Importing {}".format(name))
+        if name not in tf_weights:
+            logger.info("{} not in tf pre-trained weights, skipping".format(name))
+            continue
+        array = tf_weights[name]
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if 'kernel' in name and ('ff' in name or 'summary' in name or 'logit' in name):
+            logger.info("Transposing")
+            array = np.transpose(array)
+        if isinstance(pointer, list):
+            # Here we will split the TF weigths
+            assert len(pointer) == array.shape[0]
+            for i, p_i in enumerate(pointer):
+                arr_i = array[i, ...]
+                try:
+                    assert p_i.shape == arr_i.shape
+                except AssertionError as e:
+                    e.args += (p_i.shape, arr_i.shape)
+                    raise
+                logger.info("Initialize PyTorch weight {} for layer {}".format(name, i))
+                p_i.data = torch.from_numpy(arr_i)
+        else:
+            try:
+                assert pointer.shape == array.shape
+            except AssertionError as e:
+                e.args += (pointer.shape, array.shape)
+                raise
+            logger.info("Initialize PyTorch weight {}".format(name))
+            pointer.data = torch.from_numpy(array)
+        tf_weights.pop(name, None)
+        tf_weights.pop(name + '/Adam', None)
+        tf_weights.pop(name + '/Adam_1', None)
+
+    logger.info("Weights not copied to PyTorch model: {}".format(', '.join(tf_weights.keys())))
+    return model
+
+
+def gelu(x):
+    """ Implementation of the gelu activation function.
+        XLNet is using OpenAI GPT's gelu (not exactly the same as BERT)
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    cdf = 0.5 * (1.0 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+    return x * cdf
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
+
+
+try:
+    from apex.normalization.fused_layer_norm import FusedLayerNorm as XLNetLayerNorm
+except (ImportError, AttributeError) as e:
+    logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
+    from torch.nn import LayerNorm as XLNetLayerNorm
+
+class XLNetRelativeAttention(nn.Module):
+    def __init__(self, config):
+        super(XLNetRelativeAttention, self).__init__()
+        self.output_attentions = config.output_attentions
+
+        if config.d_model % config.n_head != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.d_model, config.n_head))
+
+        self.n_head = config.n_head
+        self.d_head = config.d_head
+        self.d_model = config.d_model
+        self.scale = 1 / (config.d_head ** 0.5)
+
+        self.q = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.k = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.v = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.o = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.r = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+
+        self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        self.r_s_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        self.seg_embed = nn.Parameter(torch.FloatTensor(2, self.n_head, self.d_head))
+
+        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.dropout)
+
+    def prune_heads(self, heads):
+        raise NotImplementedError
+
+    @staticmethod
+    def rel_shift(x, klen=-1):
+        """perform relative shift to form the relative attention score."""
+        x_size = x.shape
+
+        x = x.reshape(x_size[1], x_size[0], x_size[2], x_size[3])
+        x = x[1:, ...]
+        x = x.reshape(x_size[0], x_size[1] - 1, x_size[2], x_size[3])
+        # x = x[:, 0:klen, :, :]
+        x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))
+
+        return x
+
+    def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):
+        """Core relative positional attention operations."""
+
+        # content based attention score
+        ac = torch.einsum('ibnd,jbnd->ijbn', q_head + self.r_w_bias, k_head_h)
+
+        # position based attention score
+        bd = torch.einsum('ibnd,jbnd->ijbn', q_head + self.r_r_bias, k_head_r)
+        bd = self.rel_shift(bd, klen=ac.shape[1])
+
+        # segment based attention score
+        if seg_mat is None:
+            ef = 0
+        else:
+            ef = torch.einsum('ibnd,snd->ibns', q_head + self.r_s_bias, self.seg_embed)
+            ef = torch.einsum('ijbs,ibns->ijbn', seg_mat, ef)
+
+        # merge attention scores and perform masking
+        attn_score = (ac + bd + ef) * self.scale
+        if attn_mask is not None:
+            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask
+            if attn_mask.dtype == torch.float16:
+                attn_score = attn_score - 65500 * attn_mask
+            else:
+                attn_score = attn_score - 1e30 * attn_mask
+
+        # attention probability
+        attn_prob = F.softmax(attn_score, dim=1)
+        attn_prob = self.dropout(attn_prob)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attn_prob = attn_prob * head_mask
+
+        # attention output
+        attn_vec = torch.einsum('ijbn,jbnd->ibnd', attn_prob, v_head_h)
+
+        if self.output_attentions:
+            return attn_vec, attn_prob
+
+        return attn_vec
+
+    def post_attention(self, h, attn_vec, residual=True):
+        """Post-attention processing."""
+        # post-attention projection (back to `d_model`)
+        attn_out = torch.einsum('ibnd,hnd->ibh', attn_vec, self.o)
+
+        attn_out = self.dropout(attn_out)
+        if residual:
+            attn_out = attn_out + h
+        output = self.layer_norm(attn_out)
+
+        return output
+
+    def forward(self, h, g,
+                      attn_mask_h, attn_mask_g,
+                      r, seg_mat,
+                      mems=None, target_mapping=None, head_mask=None):
+        if g is not None:
+            ###### Two-stream attention with relative positional encoding.
+            # content based attention score
+            if mems is not None and mems.dim() > 1:
+                cat = torch.cat([mems, h], dim=0)
+            else:
+                cat = h
+
+            # content-based key head
+            k_head_h = torch.einsum('ibh,hnd->ibnd', cat, self.k)
+
+            # content-based value head
+            v_head_h = torch.einsum('ibh,hnd->ibnd', cat, self.v)
+
+            # position-based key head
+            k_head_r = torch.einsum('ibh,hnd->ibnd', r, self.r)
+
+            ##### h-stream
+            # content-stream query head
+            q_head_h = torch.einsum('ibh,hnd->ibnd', h, self.q)
+
+            # core attention ops
+            attn_vec_h = self.rel_attn_core(
+                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask)
+
+            if self.output_attentions:
+                attn_vec_h, attn_prob_h = attn_vec_h
+
+            # post processing
+            output_h = self.post_attention(h, attn_vec_h)
+
+            ##### g-stream
+            # query-stream query head
+            q_head_g = torch.einsum('ibh,hnd->ibnd', g, self.q)
+
+            # core attention ops
+            if target_mapping is not None:
+                q_head_g = torch.einsum('mbnd,mlb->lbnd', q_head_g, target_mapping)
+                attn_vec_g = self.rel_attn_core(
+                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask)
+
+                if self.output_attentions:
+                    attn_vec_g, attn_prob_g = attn_vec_g
+
+                attn_vec_g = torch.einsum('lbnd,mlb->mbnd', attn_vec_g, target_mapping)
+            else:
+                attn_vec_g = self.rel_attn_core(
+                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask)
+
+                if self.output_attentions:
+                    attn_vec_g, attn_prob_g = attn_vec_g
+
+            # post processing
+            output_g = self.post_attention(g, attn_vec_g)
+
+            if self.output_attentions:
+                attn_prob = attn_prob_h, attn_prob_g
+
+        else:
+            ###### Multi-head attention with relative positional encoding
+            if mems is not None and mems.dim() > 1:
+                cat = torch.cat([mems, h], dim=0)
+            else:
+                cat = h
+
+            # content heads
+            q_head_h = torch.einsum('ibh,hnd->ibnd', h, self.q)
+            k_head_h = torch.einsum('ibh,hnd->ibnd', cat, self.k)
+            v_head_h = torch.einsum('ibh,hnd->ibnd', cat, self.v)
+
+            # positional heads
+            k_head_r = torch.einsum('ibh,hnd->ibnd', r, self.r)
+
+            # core attention ops
+            attn_vec = self.rel_attn_core(
+                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask)
+
+            if self.output_attentions:
+                attn_vec, attn_prob = attn_vec
+
+            # post processing
+            output_h = self.post_attention(h, attn_vec)
+            output_g = None
+
+        outputs = (output_h, output_g)
+        if self.output_attentions:
+            outputs = outputs + (attn_prob,)
+        return outputs
+
+class XLNetFeedForward(nn.Module):
+    def __init__(self, config):
+        super(XLNetFeedForward, self).__init__()
+        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)
+        self.layer_1 = nn.Linear(config.d_model, config.d_inner)
+        self.layer_2 = nn.Linear(config.d_inner, config.d_model)
+        self.dropout = nn.Dropout(config.dropout)
+        if isinstance(config.ff_activation, str) or \
+                (sys.version_info[0] == 2 and isinstance(config.ff_activation, unicode)):
+            self.activation_function = ACT2FN[config.ff_activation]
+        else:
+            self.activation_function = config.ff_activation
+
+    def forward(self, inp):
+        output = inp
+        output = self.layer_1(output)
+        output = self.activation_function(output)
+        output = self.dropout(output)
+        output = self.layer_2(output)
+        output = self.dropout(output)
+        output = self.layer_norm(output + inp)
+        return output
+
+class XLNetLayer(nn.Module):
+    def __init__(self, config):
+        super(XLNetLayer, self).__init__()
+        self.rel_attn = XLNetRelativeAttention(config)
+        self.ff = XLNetFeedForward(config)
+        self.dropout = nn.Dropout(config.dropout)
+
+    def forward(self, output_h, output_g,
+                attn_mask_h, attn_mask_g,
+                r, seg_mat, mems=None, target_mapping=None, head_mask=None):
+        outputs = self.rel_attn(output_h, output_g, attn_mask_h, attn_mask_g,
+                                r, seg_mat, mems=mems, target_mapping=target_mapping,
+                                head_mask=head_mask)
+        output_h, output_g = outputs[:2]
+
+        if output_g is not None:
+            output_g = self.ff(output_g)
+        output_h = self.ff(output_h)
+
+        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there
+        return outputs
+
+
+class XLNetPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = XLNetConfig
+    pretrained_model_archive_map = XLNET_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = load_tf_weights_in_xlnet
+    base_model_prefix = "transformer"
+
+    def _init_weights(self, module):
+        """ Initialize the weights.
+        """
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if isinstance(module, nn.Linear) and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, XLNetLayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, XLNetRelativeAttention):
+            for param in [module.q, module.k, module.v, module.o, module.r,
+                          module.r_r_bias, module.r_s_bias, module.r_w_bias,
+                          module.seg_embed]:
+                param.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, XLNetModel):
+                module.mask_emb.data.normal_(mean=0.0, std=self.config.initializer_range)
+
+
+XLNET_START_DOCSTRING = r"""    The XLNet model was proposed in
+    `XLNet: Generalized Autoregressive Pretraining for Language Understanding`_
+    by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
+    XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method
+    to learn bidirectional contexts by maximizing the expected likelihood over all permutations
+    of the input sequence factorization order.
+
+    The specific attention pattern can be controlled at training and test time using the `perm_mask` input.
+
+    Do to the difficulty of training a fully auto-regressive model over various factorization order,
+    XLNet is pretrained using only a sub-set of the output tokens as target which are selected
+    with the `target_mapping` input.
+
+    To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
+    `target_mapping` inputs to control the attention span and outputs (see examples in `examples/run_generation.py`)
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`XLNet: Generalized Autoregressive Pretraining for Language Understanding`:
+        http://arxiv.org/abs/1906.08237
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~pytorch_transformers.XLNetConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+XLNET_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            XLNet is a model with relative position embeddings so you can either pad the inputs on
+            the right or on the left.
+            Indices can be obtained using :class:`pytorch_transformers.XLNetTokenizer`.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens (can be used to indicate various portions of the inputs).
+            The type indices in XLNet are NOT selected in the vocabulary, they can be arbitrary numbers and
+            the important thing is that they should be different for tokens which belong to different segments.
+            The model will compute relative segment differences from the given type indices:
+            0 if the segment id of two tokens are the same, 1 if not.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **mems**: (`optional`)
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as output by the model
+            (see `mems` output below). Can be used to speed up sequential decoding and attend to longer context.
+            To activate mems you need to set up config.mem_len to a positive value which will be the max number of tokens in
+            the memory output by the model. E.g. `model = XLNetModel.from_pretrained('xlnet-base-case, mem_len=1024)` will
+            instantiate a model which can use up to 1024 tokens of memory (in addition to the input it self).
+        **perm_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, sequence_length)``:
+            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
+            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;
+            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.
+            If None, each token attends to all the others (full bidirectional attention).
+            Only used during pretraining (to define factorization order) or for sequential decoding (generation).
+        **target_mapping**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, num_predict, sequence_length)``:
+            Mask to indicate the output tokens to use.
+            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.
+            Only used during pretraining for partial prediction or for sequential decoding (generation).
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens (can be used to indicate various portions of the inputs).
+            The embeddings from these tokens will be summed with the respective token embeddings.
+            Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
+        **input_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.
+            Kept for compatibility with the original code base.
+            You can only uses one of `input_mask` and `attention_mask`
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare XLNet Model transformer outputting raw hidden-states without any specific head on top.",
+                      XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)
+class XLNetModel(XLNetPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the last layer of the model.
+        **mems**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetModel.from_pretrained('xlnet-large-cased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config):
+        super(XLNetModel, self).__init__(config)
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        self.mem_len = config.mem_len
+        self.reuse_len = config.reuse_len
+        self.d_model = config.d_model
+        self.same_length = config.same_length
+        self.attn_type = config.attn_type
+        self.bi_data = config.bi_data
+        self.clamp_len = config.clamp_len
+        self.n_layer = config.n_layer
+
+        self.word_embedding = nn.Embedding(config.n_token, config.d_model)
+        self.mask_emb = nn.Parameter(torch.FloatTensor(1, 1, config.d_model))
+        self.layer = nn.ModuleList([XLNetLayer(config) for _ in range(config.n_layer)])
+        self.dropout = nn.Dropout(config.dropout)
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        self.word_embedding = self._get_resized_embeddings(self.word_embedding, new_num_tokens)
+        return self.word_embedding
+
+    def _prune_heads(self, heads_to_prune):
+        raise NotImplementedError
+
+    def create_mask(self, qlen, mlen):
+        """
+        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.
+
+        Args:
+            qlen: TODO Lysandre didn't fill
+            mlen: TODO Lysandre didn't fill
+
+        ::
+
+                  same_length=False:      same_length=True:
+                  <mlen > <  qlen >       <mlen > <  qlen >
+               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]
+                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]
+            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]
+                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]
+               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]
+
+        """
+        attn_mask = torch.ones([qlen, qlen])
+        mask_up = torch.triu(attn_mask, diagonal=1)
+        attn_mask_pad = torch.zeros([qlen, mlen])
+        ret = torch.cat([attn_mask_pad, mask_up], dim=1)
+        if self.same_length:
+            mask_lo = torch.tril(attn_mask, diagonal=-1)
+            ret = torch.cat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], dim=1)
+
+        ret = ret.to(next(self.parameters()))
+        return ret
+
+    def cache_mem(self, curr_out, prev_mem):
+        """cache hidden states into memory."""
+        if self.mem_len is None or self.mem_len == 0:
+            return None
+        else:
+            if self.reuse_len is not None and self.reuse_len > 0:
+                curr_out = curr_out[:self.reuse_len]
+
+            if prev_mem is None:
+                new_mem = curr_out[-self.mem_len:]
+            else:
+                new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len:]
+
+        return new_mem.detach()
+
+    @staticmethod
+    def positional_embedding(pos_seq, inv_freq, bsz=None):
+        sinusoid_inp = torch.einsum('i,d->id', pos_seq, inv_freq)
+        pos_emb = torch.cat([torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)], dim=-1)
+        pos_emb = pos_emb[:, None, :]
+
+        if bsz is not None:
+            pos_emb = pos_emb.expand(-1, bsz, -1)
+
+        return pos_emb
+
+    def relative_positional_encoding(self, qlen, klen, bsz=None):
+        """create relative positional encoding."""
+        freq_seq = torch.arange(0, self.d_model, 2.0, dtype=torch.float)
+        inv_freq = 1 / torch.pow(10000, (freq_seq / self.d_model))
+
+        if self.attn_type == 'bi':
+            # beg, end = klen - 1, -qlen
+            beg, end = klen, -qlen
+        elif self.attn_type == 'uni':
+            # beg, end = klen - 1, -1
+            beg, end = klen, -1
+        else:
+            raise ValueError('Unknown `attn_type` {}.'.format(self.attn_type))
+
+        if self.bi_data:
+            fwd_pos_seq = torch.arange(beg, end, -1.0, dtype=torch.float)
+            bwd_pos_seq = torch.arange(-beg, -end, 1.0, dtype=torch.float)
+
+            if self.clamp_len > 0:
+                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
+                bwd_pos_seq = bwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
+
+            if bsz is not None:
+                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz//2)
+                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz//2)
+            else:
+                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)
+                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)
+
+            pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=1)
+        else:
+            fwd_pos_seq = torch.arange(beg, end, -1.0)
+            if self.clamp_len > 0:
+                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
+            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)
+
+        pos_emb = pos_emb.to(next(self.parameters()))
+        return pos_emb
+
+    def forward(self, input_ids, attention_mask=None, mems=None, perm_mask=None, target_mapping=None,
+                token_type_ids=None, input_mask=None, head_mask=None):
+        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
+        # but we want a unified interface in the library with the batch size on the first dimension
+        # so we move here the first dimension (batch) to the end
+        input_ids = input_ids.transpose(0, 1).contiguous()
+        token_type_ids = token_type_ids.transpose(0, 1).contiguous() if token_type_ids is not None else None
+        input_mask = input_mask.transpose(0, 1).contiguous() if input_mask is not None else None
+        attention_mask = attention_mask.transpose(0, 1).contiguous() if attention_mask is not None else None
+        perm_mask = perm_mask.permute(1, 2, 0).contiguous() if perm_mask is not None else None
+        target_mapping = target_mapping.permute(1, 2, 0).contiguous() if target_mapping is not None else None
+
+        qlen, bsz = input_ids.shape[0], input_ids.shape[1]
+        mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0
+        klen = mlen + qlen
+
+        dtype_float = next(self.parameters()).dtype
+        device = next(self.parameters()).device
+
+        ##### Attention mask
+        # causal attention mask
+        if self.attn_type == 'uni':
+            attn_mask = self.create_mask(qlen, mlen)
+            attn_mask = attn_mask[:, :, None, None]
+        elif self.attn_type == 'bi':
+            attn_mask = None
+        else:
+            raise ValueError('Unsupported attention type: {}'.format(self.attn_type))
+
+        # data mask: input mask & perm mask
+        assert input_mask is None or attention_mask is None, "You can only use one of input_mask (uses 1 for padding) "
+        "or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one."
+        if input_mask is None and attention_mask is not None:
+            input_mask = 1.0 - attention_mask
+        if input_mask is not None and perm_mask is not None:
+            data_mask = input_mask[None] + perm_mask
+        elif input_mask is not None and perm_mask is None:
+            data_mask = input_mask[None]
+        elif input_mask is None and perm_mask is not None:
+            data_mask = perm_mask
+        else:
+            data_mask = None
+
+        if data_mask is not None:
+            # all mems can be attended to
+            if mlen > 0:
+                mems_mask = torch.zeros([data_mask.shape[0], mlen, bsz]).to(data_mask)
+                data_mask = torch.cat([mems_mask, data_mask], dim=1)
+            if attn_mask is None:
+                attn_mask = data_mask[:, :, :, None]
+            else:
+                attn_mask += data_mask[:, :, :, None]
+
+        if attn_mask is not None:
+            attn_mask = (attn_mask > 0).to(dtype_float)
+
+        if attn_mask is not None:
+            non_tgt_mask = -torch.eye(qlen).to(attn_mask)
+            if mlen > 0:
+                non_tgt_mask = torch.cat([torch.zeros([qlen, mlen]).to(attn_mask), non_tgt_mask], dim=-1)
+            non_tgt_mask = ((attn_mask + non_tgt_mask[:, :, None, None]) > 0).to(attn_mask)
+        else:
+            non_tgt_mask = None
+
+        ##### Word embeddings and prepare h & g hidden states
+        word_emb_k = self.word_embedding(input_ids)
+        output_h = self.dropout(word_emb_k)
+        if target_mapping is not None:
+            word_emb_q = self.mask_emb.expand(target_mapping.shape[0], bsz, -1)
+        # else:  # We removed the inp_q input which was same as target mapping
+        #     inp_q_ext = inp_q[:, :, None]
+        #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k
+            output_g = self.dropout(word_emb_q)
+        else:
+            output_g = None
+
+        ##### Segment embedding
+        if token_type_ids is not None:
+            # Convert `token_type_ids` to one-hot `seg_mat`
+            if mlen > 0:
+                mem_pad = torch.zeros([mlen, bsz], dtype=torch.long, device=device)
+                cat_ids = torch.cat([mem_pad, token_type_ids], dim=0)
+            else:
+                cat_ids = token_type_ids
+
+            # `1` indicates not in the same segment [qlen x klen x bsz]
+            seg_mat = (token_type_ids[:, None] != cat_ids[None, :]).long()
+            seg_mat = F.one_hot(seg_mat, num_classes=2).to(dtype_float)
+        else:
+            seg_mat = None
+
+        ##### Positional encoding
+        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz)
+        pos_emb = self.dropout(pos_emb)
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)
+        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)
+                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.n_layer
+
+        new_mems = ()
+        if mems is None:
+            mems = [None] * len(self.layer)
+
+        attentions = []
+        hidden_states = []
+        for i, layer_module in enumerate(self.layer):
+            # cache new mems
+            new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
+            if self.output_hidden_states:
+                hidden_states.append((output_h, output_g) if output_g is not None else output_h)
+
+            outputs = layer_module(output_h, output_g, attn_mask_h=non_tgt_mask, attn_mask_g=attn_mask,
+                                   r=pos_emb, seg_mat=seg_mat, mems=mems[i], target_mapping=target_mapping,
+                                   head_mask=head_mask[i])
+            output_h, output_g = outputs[:2]
+            if self.output_attentions:
+                attentions.append(outputs[2])
+
+        # Add last hidden state
+        if self.output_hidden_states:
+            hidden_states.append((output_h, output_g) if output_g is not None else output_h)
+
+        output = self.dropout(output_g if output_g is not None else output_h)
+
+        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
+        outputs = (output.permute(1, 0, 2).contiguous(), new_mems)
+        if self.output_hidden_states:
+            if output_g is not None:
+                hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)
+            else:
+                hidden_states = tuple(hs.permute(1, 0, 2).contiguous() for hs in hidden_states)
+            outputs = outputs + (hidden_states,)
+        if self.output_attentions:
+            attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)
+            outputs = outputs + (attentions,)
+
+        return outputs  # outputs, new_mems, (hidden_states), (attentions)
+
+
+@add_start_docstrings("""XLNet Model with a language modeling head on top
+    (linear layer with weights tied to the input embeddings). """,
+    XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)
+class XLNetLMHeadModel(XLNetPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **mems**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
+        # We show how to setup inputs to predict a next token using a bi-directional context.
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
+        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
+        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
+        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
+        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
+        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
+
+    """
+    def __init__(self, config):
+        super(XLNetLMHeadModel, self).__init__(config)
+        self.attn_type = config.attn_type
+        self.same_length = config.same_length
+
+        self.transformer = XLNetModel(config)
+        self.lm_loss = nn.Linear(config.d_model, config.n_token, bias=True)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the embeddings
+        """
+        self._tie_or_clone_weights(self.lm_loss, self.transformer.word_embedding)
+
+    def forward(self, input_ids, attention_mask=None, mems=None, perm_mask=None, target_mapping=None,
+                token_type_ids=None, input_mask=None, head_mask=None, labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               mems=mems,
+                                               perm_mask=perm_mask,
+                                               target_mapping=target_mapping,
+                                               token_type_ids=token_type_ids,
+                                               input_mask=input_mask, 
+                                               head_mask=head_mask)
+
+        logits = self.lm_loss(transformer_outputs[0])
+
+        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it
+
+        if labels is not None:
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            loss = loss_fct(logits.view(-1, logits.size(-1)),
+                            labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # return (loss), logits, mems, (hidden states), (attentions)
+
+
+@add_start_docstrings("""XLNet Model with a sequence classification/regression head on top (a linear layer on top of
+    the pooled output) e.g. for GLUE tasks. """,
+    XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)
+class XLNetForSequenceClassification(XLNetPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **mems**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(XLNetForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.transformer = XLNetModel(config)
+        self.sequence_summary = SequenceSummary(config)
+        self.logits_proj = nn.Linear(config.d_model, config.num_labels)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, mems=None, perm_mask=None, target_mapping=None,
+                token_type_ids=None, input_mask=None, head_mask=None, labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               mems=mems,
+                                               perm_mask=perm_mask,
+                                               target_mapping=target_mapping,
+                                               token_type_ids=token_type_ids,
+                                               input_mask=input_mask, 
+                                               head_mask=head_mask)
+        output = transformer_outputs[0]
+
+        output = self.sequence_summary(output)
+        logits = self.logits_proj(output)
+
+        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it
+
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # return (loss), logits, mems, (hidden states), (attentions)
+
+@add_start_docstrings("""XLNet Model with a multiple choice classification head on top (a linear layer on top of
+    the pooled output and a softmax) e.g. for RACE/SWAG tasks. """,
+    XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)
+class XLNetForMultipleChoice(XLNetPreTrainedModel):
+    r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            The second dimension of the input (`num_choices`) indicates the number of choices to scores.
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+            Segment token indices to indicate first and second portions of the inputs.
+            The second dimension of the input (`num_choices`) indicates the number of choices to score.
+            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            The second dimension of the input (`num_choices`) indicates the number of choices to score.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above)
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification loss.
+        **classification_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above).
+            Classification scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+        model = XLNetForMultipleChoice.from_pretrained('xlnet-base-cased')
+        choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, classification_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(XLNetForMultipleChoice, self).__init__(config)
+
+        self.transformer = XLNetModel(config)
+        self.sequence_summary = SequenceSummary(config)
+        self.logits_proj = nn.Linear(config.d_model, 1)
+
+        self.init_weights()
+
+    def forward(self, input_ids, token_type_ids=None, input_mask=None, attention_mask=None,
+                mems=None, perm_mask=None, target_mapping=None,
+                labels=None, head_mask=None):
+        num_choices = input_ids.shape[1]
+
+        flat_input_ids = input_ids.view(-1, input_ids.size(-1))
+        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        flat_input_mask = input_mask.view(-1, input_mask.size(-1)) if input_mask is not None else None
+
+        transformer_outputs = self.transformer(flat_input_ids, token_type_ids=flat_token_type_ids,
+                                               input_mask=flat_input_mask, attention_mask=flat_attention_mask,
+                                               mems=mems, perm_mask=perm_mask, target_mapping=target_mapping,
+                                               head_mask=head_mask)
+
+
+        output = transformer_outputs[0]
+
+        output = self.sequence_summary(output)
+        logits = self.logits_proj(output)
+        reshaped_logits = logits.view(-1, num_choices)
+        outputs = (reshaped_logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # return (loss), logits, mems, (hidden states), (attentions)
+
+
+@add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)
+class XLNetForQuestionAnswering(XLNetPreTrainedModel):
+    r"""
+        **start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        **end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        **is_impossible**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels whether a question has an answer or no answer (SQuAD 2.0)
+        **cls_index**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
+        **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).
+            1.0 means token should be masked. 0.0 mean token is not masked.
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned if both ``start_positions`` and ``end_positions`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.
+        **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``
+            Log probabilities for the top config.start_n_top start token possibilities (beam-search).
+        **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``
+            Indices for the top config.start_n_top start token possibilities (beam-search).
+        **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``
+            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).
+        **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``
+            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).
+        **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
+            ``torch.FloatTensor`` of shape ``(batch_size,)``
+            Log probabilities for the ``is_impossible`` label of the answers.
+        **mems**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer =  XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(XLNetForQuestionAnswering, self).__init__(config)
+        self.start_n_top = config.start_n_top
+        self.end_n_top = config.end_n_top
+
+        self.transformer = XLNetModel(config)
+        self.start_logits = PoolerStartLogits(config)
+        self.end_logits = PoolerEndLogits(config)
+        self.answer_class = PoolerAnswerClass(config)
+
+        self.init_weights()
+
+    def forward(self, input_ids, attention_mask=None, mems=None, perm_mask=None, target_mapping=None,
+                token_type_ids=None, input_mask=None, head_mask=None,
+                start_positions=None, end_positions=None, is_impossible=None, cls_index=None, p_mask=None,):
+        transformer_outputs = self.transformer(input_ids,
+                                               attention_mask=attention_mask,
+                                               mems=mems,
+                                               perm_mask=perm_mask,
+                                               target_mapping=target_mapping,
+                                               token_type_ids=token_type_ids,
+                                               input_mask=input_mask, 
+                                               head_mask=head_mask)
+        hidden_states = transformer_outputs[0]
+        start_logits = self.start_logits(hidden_states, p_mask=p_mask)
+
+        outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it
+
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, let's remove the dimension added by batch splitting
+            for x in (start_positions, end_positions, cls_index, is_impossible):
+                if x is not None and x.dim() > 1:
+                    x.squeeze_(-1)
+
+            # during training, compute the end logits based on the ground truth of the start position
+            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)
+
+            loss_fct = CrossEntropyLoss()
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+            if cls_index is not None and is_impossible is not None:
+                # Predict answerability from the representation of CLS and START
+                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)
+                loss_fct_cls = nn.BCEWithLogitsLoss()
+                cls_loss = loss_fct_cls(cls_logits, is_impossible)
+
+                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss
+                total_loss += cls_loss * 0.5
+
+            outputs = (total_loss,) + outputs
+
+        else:
+            # during inference, compute the end logits based on beam search
+            bsz, slen, hsz = hidden_states.size()
+            start_log_probs = F.softmax(start_logits, dim=-1) # shape (bsz, slen)
+
+            start_top_log_probs, start_top_index = torch.topk(start_log_probs, self.start_n_top, dim=-1) # shape (bsz, start_n_top)
+            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz) # shape (bsz, start_n_top, hsz)
+            start_states = torch.gather(hidden_states, -2, start_top_index_exp) # shape (bsz, start_n_top, hsz)
+            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1) # shape (bsz, slen, start_n_top, hsz)
+
+            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(start_states) # shape (bsz, slen, start_n_top, hsz)
+            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None
+            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)
+            end_log_probs = F.softmax(end_logits, dim=1) # shape (bsz, slen, start_n_top)
+
+            end_top_log_probs, end_top_index = torch.topk(end_log_probs, self.end_n_top, dim=1) # shape (bsz, end_n_top, start_n_top)
+            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)
+            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)
+
+            start_states = torch.einsum("blh,bl->bh", hidden_states, start_log_probs)  # get the representation of START as weighted sum of hidden states
+            cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)  # Shape (batch size,): one single `cls_logits` for each sample
+
+            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs
+
+        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits
+        # or (if labels are provided) (total_loss,)
+        return outputs
diff --git a/Optimus/code/pytorch_transformers/optimization.py b/Optimus/code/pytorch_transformers/optimization.py
new file mode 100755
index 0000000000000000000000000000000000000000..39dc7a50ff10bdcb36cd09f0d73697d2f9510cab
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/optimization.py
@@ -0,0 +1,189 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch optimization for BERT model."""
+
+import logging
+import math
+
+import torch
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import LambdaLR
+
+logger = logging.getLogger(__name__)
+
+class ConstantLRSchedule(LambdaLR):
+    """ Constant learning rate schedule.
+    """
+    def __init__(self, optimizer, last_epoch=-1):
+        super(ConstantLRSchedule, self).__init__(optimizer, lambda _: 1.0, last_epoch=last_epoch)
+
+
+class WarmupConstantSchedule(LambdaLR):
+    """ Linear warmup and then constant.
+        Linearly increases learning rate schedule from 0 to 1 over `warmup_steps` training steps.
+        Keeps learning rate schedule equal to 1. after warmup_steps.
+    """
+    def __init__(self, optimizer, warmup_steps, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        super(WarmupConstantSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
+
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1.0, self.warmup_steps))
+        return 1.
+
+
+class WarmupLinearSchedule(LambdaLR):
+    """ Linear warmup and then linear decay.
+        Linearly increases learning rate from 0 to 1 over `warmup_steps` training steps.
+        Linearly decreases learning rate from 1. to 0. over remaining `t_total - warmup_steps` steps.
+    """
+    def __init__(self, optimizer, warmup_steps, t_total, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.t_total = t_total
+        super(WarmupLinearSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
+
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1, self.warmup_steps))
+        return max(0.0, float(self.t_total - step) / float(max(1.0, self.t_total - self.warmup_steps)))
+
+
+class WarmupCosineSchedule(LambdaLR):
+    """ Linear warmup and then cosine decay.
+        Linearly increases learning rate from 0 to 1 over `warmup_steps` training steps.
+        Decreases learning rate from 1. to 0. over remaining `t_total - warmup_steps` steps following a cosine curve.
+        If `cycles` (default=0.5) is different from default, learning rate follows cosine function after warmup.
+    """
+    def __init__(self, optimizer, warmup_steps, t_total, cycles=.5, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.t_total = t_total
+        self.cycles = cycles
+        super(WarmupCosineSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
+
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1.0, self.warmup_steps))
+        # progress after warmup
+        progress = float(step - self.warmup_steps) / float(max(1, self.t_total - self.warmup_steps))
+        return max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+
+
+class WarmupCosineWithHardRestartsSchedule(LambdaLR):
+    """ Linear warmup and then cosine cycles with hard restarts.
+        Linearly increases learning rate from 0 to 1 over `warmup_steps` training steps.
+        If `cycles` (default=1.) is different from default, learning rate follows `cycles` times a cosine decaying
+        learning rate (with hard restarts).
+    """
+    def __init__(self, optimizer, warmup_steps, t_total, cycles=1., last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.t_total = t_total
+        self.cycles = cycles
+        super(WarmupCosineWithHardRestartsSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
+
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1, self.warmup_steps))
+        # progress after warmup
+        progress = float(step - self.warmup_steps) / float(max(1, self.t_total - self.warmup_steps))
+        if progress >= 1.0:
+            return 0.0
+        return max(0.0, 0.5 * (1. + math.cos(math.pi * ((float(self.cycles) * progress) % 1.0))))
+
+
+
+class AdamW(Optimizer):
+    """ Implements Adam algorithm with weight decay fix.
+
+    Parameters:
+        lr (float): learning rate. Default 1e-3.
+        betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.999)
+        eps (float): Adams epsilon. Default: 1e-6
+        weight_decay (float): Weight decay. Default: 0.0
+        correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.
+    """
+    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True):
+        if lr < 0.0:
+            raise ValueError("Invalid learning rate: {} - should be >= 0.0".format(lr))
+        if not 0.0 <= betas[0] < 1.0:
+            raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[0]))
+        if not 0.0 <= betas[1]  < 1.0:
+            raise ValueError("Invalid beta parameter: {} - should be in [0.0, 1.0[".format(betas[1]))
+        if not 0.0 <= eps:
+            raise ValueError("Invalid epsilon value: {} - should be >= 0.0".format(eps))
+        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
+                        correct_bias=correct_bias)
+        super(AdamW, self).__init__(params, defaults)
+
+    def step(self, closure=None):
+        """Performs a single optimization step.
+
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+
+        for group in self.param_groups:
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                grad = p.grad.data
+                if grad.is_sparse:
+                    raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
+
+                state = self.state[p]
+
+                # State initialization
+                if len(state) == 0:
+                    state['step'] = 0
+                    # Exponential moving average of gradient values
+                    state['exp_avg'] = torch.zeros_like(p.data)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p.data)
+
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+                beta1, beta2 = group['betas']
+
+                state['step'] += 1
+
+                # Decay the first and second moment running average coefficient
+                # In-place operations to update the averages at the same time
+                exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
+                exp_avg_sq.mul_(beta2).addcmul_(1.0 - beta2, grad, grad)
+                denom = exp_avg_sq.sqrt().add_(group['eps'])
+
+                step_size = group['lr']
+                if group['correct_bias']:  # No bias correction for Bert
+                    bias_correction1 = 1.0 - beta1 ** state['step']
+                    bias_correction2 = 1.0 - beta2 ** state['step']
+                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1
+
+                p.data.addcdiv_(-step_size, exp_avg, denom)
+
+                # Just adding the square of the weights to the loss function is *not*
+                # the correct way of using L2 regularization/weight decay with Adam,
+                # since that will interact with the m and v parameters in strange ways.
+                #
+                # Instead we want to decay the weights in a manner that doesn't interact
+                # with the m/v parameters. This is equivalent to adding the square
+                # of the weights to the loss with plain (non-momentum) SGD.
+                # Add weight decay at the end (fixed version)
+                if group['weight_decay'] > 0.0:
+                    p.data.add_(-group['lr'] * group['weight_decay'], p.data)
+
+        return loss
diff --git a/Optimus/code/pytorch_transformers/tests/__init__.py b/Optimus/code/pytorch_transformers/tests/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/Optimus/code/pytorch_transformers/tests/configuration_common_test.py b/Optimus/code/pytorch_transformers/tests/configuration_common_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..8ee751153c1721bdfe842278b2d382812cd2c542
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/configuration_common_test.py
@@ -0,0 +1,63 @@
+# coding=utf-8
+# Copyright 2019 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+import os
+import shutil
+import json
+import random
+import uuid
+
+import unittest
+import logging
+
+
+class ConfigTester(object):
+    def __init__(self, parent, config_class=None, **kwargs):
+        self.parent = parent
+        self.config_class = config_class
+        self.inputs_dict = kwargs
+
+    def create_and_test_config_common_properties(self):
+        config = self.config_class(**self.inputs_dict)
+        self.parent.assertTrue(hasattr(config, 'vocab_size'))
+        self.parent.assertTrue(hasattr(config, 'hidden_size'))
+        self.parent.assertTrue(hasattr(config, 'num_attention_heads'))
+        self.parent.assertTrue(hasattr(config, 'num_hidden_layers'))
+
+    def create_and_test_config_to_json_string(self):
+        config = self.config_class(**self.inputs_dict)
+        obj = json.loads(config.to_json_string())
+        for key, value in self.inputs_dict.items():
+            self.parent.assertEqual(obj[key], value)
+
+    def create_and_test_config_to_json_file(self):
+        config_first = self.config_class(**self.inputs_dict)
+        json_file_path = os.path.join(os.getcwd(), "config_" + str(uuid.uuid4()) + ".json")
+        config_first.to_json_file(json_file_path)
+        config_second = self.config_class.from_json_file(json_file_path)
+        os.remove(json_file_path)
+        self.parent.assertEqual(config_second.to_dict(), config_first.to_dict())
+
+    def run_common_tests(self):
+        self.create_and_test_config_common_properties()
+        self.create_and_test_config_to_json_string()
+        self.create_and_test_config_to_json_file()
+
+if __name__ == "__main__":
+    unittest.main()
\ No newline at end of file
diff --git a/Optimus/code/pytorch_transformers/tests/conftest.py b/Optimus/code/pytorch_transformers/tests/conftest.py
new file mode 100755
index 0000000000000000000000000000000000000000..841ebc8df9ee11b40fa2b8dc7fbe5e9004fcee70
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/conftest.py
@@ -0,0 +1,19 @@
+# content of conftest.py
+
+import pytest
+
+
+def pytest_addoption(parser):
+    parser.addoption(
+        "--runslow", action="store_true", default=False, help="run slow tests"
+    )
+
+
+def pytest_collection_modifyitems(config, items):
+    if config.getoption("--runslow"):
+        # --runslow given in cli: do not skip slow tests
+        return
+    skip_slow = pytest.mark.skip(reason="need --runslow option to run")
+    for item in items:
+        if "slow" in item.keywords:
+            item.add_marker(skip_slow)
diff --git a/Optimus/code/pytorch_transformers/tests/fixtures/input.txt b/Optimus/code/pytorch_transformers/tests/fixtures/input.txt
new file mode 100755
index 0000000000000000000000000000000000000000..d1e3f410d07833e4c5c233ffd54f8d2b54ebb7cf
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/fixtures/input.txt
@@ -0,0 +1 @@
+Who was Jim Henson ? ||| Jim Henson was a puppeteer
diff --git a/Optimus/code/pytorch_transformers/tests/fixtures/sample_text.txt b/Optimus/code/pytorch_transformers/tests/fixtures/sample_text.txt
new file mode 100755
index 0000000000000000000000000000000000000000..a42812060c576bae870eb29b1ac083fda0d239d3
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/fixtures/sample_text.txt
@@ -0,0 +1,33 @@
+This text is included to make sure Unicode is handled properly: 力加勝北区ᴵᴺᵀᵃছজটডণত
+Text should be one-sentence-per-line, with empty lines between documents.
+This sample text is public domain and was randomly selected from Project Guttenberg.
+
+The rain had only ceased with the gray streaks of morning at Blazing Star, and the settlement awoke to a moral sense of cleanliness, and the finding of forgotten knives, tin cups, and smaller camp utensils, where the heavy showers had washed away the debris and dust heaps before the cabin doors.
+Indeed, it was recorded in Blazing Star that a fortunate early riser had once picked up on the highway a solid chunk of gold quartz which the rain had freed from its incumbering soil, and washed into immediate and glittering popularity.
+Possibly this may have been the reason why early risers in that locality, during the rainy season, adopted a thoughtful habit of body, and seldom lifted their eyes to the rifted or india-ink washed skies above them.
+"Cass" Beard had risen early that morning, but not with a view to discovery.
+A leak in his cabin roof,--quite consistent with his careless, improvident habits,--had roused him at 4 A. M., with a flooded "bunk" and wet blankets.
+The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor's to supply the deficiency.
+This was nearly opposite.
+Mr. Cassius crossed the highway, and stopped suddenly.
+Something glittered in the nearest red pool before him.
+Gold, surely!
+But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible, but a bit of jeweler's handicraft in the form of a plain gold ring.
+Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
+Like most of his fellow gold-seekers, Cass was superstitious.
+
+The fountain of classic wisdom, Hypatia herself.
+As the ancient sage--the name is unimportant to a monk--pumped water nightly that he might study by day, so I, the guardian of cloaks and parasols, at the sacred doors of her lecture-room, imbibe celestial knowledge.
+From my youth I felt in me a soul above the matter-entangled herd.
+She revealed to me the glorious fact, that I am a spark of Divinity itself.
+A fallen star, I am, sir!' continued he, pensively, stroking his lean stomach--'a fallen star!--fallen, if the dignity of philosophy will allow of the simile, among the hogs of the lower world--indeed, even into the hog-bucket itself. Well, after all, I will show you the way to the Archbishop's.
+There is a philosophic pleasure in opening one's treasures to the modest young.
+Perhaps you will assist me by carrying this basket of fruit?' And the little man jumped up, put his basket on Philammon's head, and trotted off up a neighbouring street.
+Philammon followed, half contemptuous, half wondering at what this philosophy might be, which could feed the self-conceit of anything so abject as his ragged little apish guide;
+but the novel roar and whirl of the street, the perpetual stream of busy faces, the line of curricles, palanquins, laden asses, camels, elephants, which met and passed him, and squeezed him up steps and into doorways, as they threaded their way through the great Moon-gate into the ample street beyond, drove everything from his mind but wondering curiosity, and a vague, helpless dread of that great living wilderness, more terrible than any dead wilderness of sand which he had left behind.
+Already he longed for the repose, the silence of the Laura--for faces which knew him and smiled upon him; but it was too late to turn back now.
+His guide held on for more than a mile up the great main street, crossed in the centre of the city, at right angles, by one equally magnificent, at each end of which, miles away, appeared, dim and distant over the heads of the living stream of passengers, the yellow sand-hills of the desert;
+while at the end of the vista in front of them gleamed the blue harbour, through a network of countless masts.
+At last they reached the quay at the opposite end of the street;
+and there burst on Philammon's astonished eyes a vast semicircle of blue sea, ringed with palaces and towers.
+He stopped involuntarily; and his little guide stopped also, and looked askance at the young monk, to watch the effect which that grand panorama should produce on him.
diff --git a/Optimus/code/pytorch_transformers/tests/fixtures/test_sentencepiece.model b/Optimus/code/pytorch_transformers/tests/fixtures/test_sentencepiece.model
new file mode 100755
index 0000000000000000000000000000000000000000..c93fabdc0d8840e28baff407ec1a048eff8abc23
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/fixtures/test_sentencepiece.model
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8dfd1eae4522281b1b839eab877a791befec7a1663a41c814c77d9c89c748f2d
+size 253154
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_auto_test.py b/Optimus/code/pytorch_transformers/tests/modeling_auto_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..dfdedbbe6129ba5141d9de777740f93fd62416df
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_auto_test.py
@@ -0,0 +1,88 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+import logging
+
+from pytorch_transformers import (AutoConfig, BertConfig,
+                                  AutoModel, BertModel,
+                                  AutoModelWithLMHead, BertForMaskedLM,
+                                  AutoModelForSequenceClassification, BertForSequenceClassification,
+                                  AutoModelForQuestionAnswering, BertForQuestionAnswering)
+from pytorch_transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class AutoModelTest(unittest.TestCase):
+    def test_model_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            config = AutoConfig.from_pretrained(model_name)
+            self.assertIsNotNone(config)
+            self.assertIsInstance(config, BertConfig)
+
+            model = AutoModel.from_pretrained(model_name)
+            model, loading_info = AutoModel.from_pretrained(model_name, output_loading_info=True)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(model, BertModel)
+            for value in loading_info.values():
+                self.assertEqual(len(value), 0)
+
+    def test_lmhead_model_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            config = AutoConfig.from_pretrained(model_name)
+            self.assertIsNotNone(config)
+            self.assertIsInstance(config, BertConfig)
+
+            model = AutoModelWithLMHead.from_pretrained(model_name)
+            model, loading_info = AutoModelWithLMHead.from_pretrained(model_name, output_loading_info=True)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(model, BertForMaskedLM)
+
+    def test_sequence_classification_model_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            config = AutoConfig.from_pretrained(model_name)
+            self.assertIsNotNone(config)
+            self.assertIsInstance(config, BertConfig)
+
+            model = AutoModelForSequenceClassification.from_pretrained(model_name)
+            model, loading_info = AutoModelForSequenceClassification.from_pretrained(model_name, output_loading_info=True)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(model, BertForSequenceClassification)
+
+    def test_question_answering_model_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            config = AutoConfig.from_pretrained(model_name)
+            self.assertIsNotNone(config)
+            self.assertIsInstance(config, BertConfig)
+
+            model = AutoModelForQuestionAnswering.from_pretrained(model_name)
+            model, loading_info = AutoModelForQuestionAnswering.from_pretrained(model_name, output_loading_info=True)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(model, BertForQuestionAnswering)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_bert_test.py b/Optimus/code/pytorch_transformers/tests/modeling_bert_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..2919cc033682a2b0b9957d506350b1dade88a85b
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_bert_test.py
@@ -0,0 +1,315 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+
+from pytorch_transformers import (BertConfig, BertModel, BertForMaskedLM,
+                                     BertForNextSentencePrediction, BertForPreTraining,
+                                     BertForQuestionAnswering, BertForSequenceClassification,
+                                     BertForTokenClassification, BertForMultipleChoice)
+from pytorch_transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class BertModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (BertModel, BertForMaskedLM, BertForNextSentencePrediction,
+            BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification,
+            BertForTokenClassification)
+
+    class BertModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_input_mask=True,
+                     use_token_type_ids=True,
+                     use_labels=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                    ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_input_mask = use_input_mask
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = BertConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                hidden_size=self.hidden_size,
+                num_hidden_layers=self.num_hidden_layers,
+                num_attention_heads=self.num_attention_heads,
+                intermediate_size=self.intermediate_size,
+                hidden_act=self.hidden_act,
+                hidden_dropout_prob=self.hidden_dropout_prob,
+                attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+                max_position_embeddings=self.max_position_embeddings,
+                type_vocab_size=self.type_vocab_size,
+                initializer_range=self.initializer_range)
+
+            return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_bert_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = BertModel(config=config)
+            model.eval()
+            sequence_output, pooled_output = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+            sequence_output, pooled_output = model(input_ids, token_type_ids=token_type_ids)
+            sequence_output, pooled_output = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output,
+                "pooled_output": pooled_output,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
+
+
+        def create_and_check_bert_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = BertForMaskedLM(config=config)
+            model.eval()
+            loss, prediction_scores = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, masked_lm_labels=token_labels)
+            result = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+            }
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.check_loss_output(result)
+
+        def create_and_check_bert_for_next_sequence_prediction(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = BertForNextSentencePrediction(config=config)
+            model.eval()
+            loss, seq_relationship_score = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, next_sentence_label=sequence_labels)
+            result = {
+                "loss": loss,
+                "seq_relationship_score": seq_relationship_score,
+            }
+            self.parent.assertListEqual(
+                list(result["seq_relationship_score"].size()),
+                [self.batch_size, 2])
+            self.check_loss_output(result)
+
+
+        def create_and_check_bert_for_pretraining(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = BertForPreTraining(config=config)
+            model.eval()
+            loss, prediction_scores, seq_relationship_score = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids,
+                                                                    masked_lm_labels=token_labels, next_sentence_label=sequence_labels)
+            result = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+                "seq_relationship_score": seq_relationship_score,
+            }
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.parent.assertListEqual(
+                list(result["seq_relationship_score"].size()),
+                [self.batch_size, 2])
+            self.check_loss_output(result)
+
+
+        def create_and_check_bert_for_question_answering(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = BertForQuestionAnswering(config=config)
+            model.eval()
+            loss, start_logits, end_logits = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids,
+                                                   start_positions=sequence_labels, end_positions=sequence_labels)
+            result = {
+                "loss": loss,
+                "start_logits": start_logits,
+                "end_logits": end_logits,
+            }
+            self.parent.assertListEqual(
+                list(result["start_logits"].size()),
+                [self.batch_size, self.seq_length])
+            self.parent.assertListEqual(
+                list(result["end_logits"].size()),
+                [self.batch_size, self.seq_length])
+            self.check_loss_output(result)
+
+
+        def create_and_check_bert_for_sequence_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            config.num_labels = self.num_labels
+            model = BertForSequenceClassification(config)
+            model.eval()
+            loss, logits = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels)
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.num_labels])
+            self.check_loss_output(result)
+
+
+        def create_and_check_bert_for_token_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            config.num_labels = self.num_labels
+            model = BertForTokenClassification(config=config)
+            model.eval()
+            loss, logits = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.seq_length, self.num_labels])
+            self.check_loss_output(result)
+
+
+        def create_and_check_bert_for_multiple_choice(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            config.num_choices = self.num_choices
+            model = BertForMultipleChoice(config=config)
+            model.eval()
+            multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
+            multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
+            multiple_choice_input_mask = input_mask.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
+            loss, logits = model(multiple_choice_inputs_ids,
+                                 attention_mask=multiple_choice_input_mask,
+                                 token_type_ids=multiple_choice_token_type_ids,
+                                 labels=choice_labels)
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.num_choices])
+            self.check_loss_output(result)
+
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, token_type_ids, input_mask,
+             sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = BertModelTest.BertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=BertConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_bert_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_for_multiple_choice(*config_and_inputs)
+
+    def test_for_next_sequence_prediction(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_for_next_sequence_prediction(*config_and_inputs)
+
+    def test_for_pretraining(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_for_pretraining(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_bert_for_token_classification(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = BertModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_common_test.py b/Optimus/code/pytorch_transformers/tests/modeling_common_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..c6194fefcce30783806bbe11fc7dcc28baa45208
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_common_test.py
@@ -0,0 +1,711 @@
+# coding=utf-8
+# Copyright 2019 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+import os
+import shutil
+import json
+import random
+import uuid
+
+import unittest
+import logging
+
+import torch
+
+from pytorch_transformers import (PretrainedConfig, PreTrainedModel,
+                                  BertModel, BertConfig, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                  GPT2LMHeadModel, GPT2Config, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if '_range' in key or '_std' in key:
+            setattr(configs_no_init, key, 0.0)
+    return configs_no_init
+
+class CommonTestCases:
+
+    class CommonModelTester(unittest.TestCase):
+
+        model_tester = None
+        all_model_classes = ()
+        test_torchscript = True
+        test_pruning = True
+        test_resize_embeddings = True
+        test_head_masking = True
+
+        def test_initialization(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            configs_no_init = _config_zero_init(config)
+            for model_class in self.all_model_classes:
+                model = model_class(config=configs_no_init)
+                for name, param in model.named_parameters():
+                    if param.requires_grad:
+                        self.assertIn(param.data.mean().item(), [0.0, 1.0],
+                        msg="Parameter {} of model {} seems not properly initialized".format(name, model_class))
+
+        def test_attention_outputs(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            for model_class in self.all_model_classes:
+                config.output_attentions = True
+                config.output_hidden_states = False
+                model = model_class(config)
+                model.eval()
+                outputs = model(**inputs_dict)
+                attentions = outputs[-1]
+                self.assertEqual(model.config.output_attentions, True)
+                self.assertEqual(model.config.output_hidden_states, False)
+                self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+                self.assertListEqual(
+                    list(attentions[0].shape[-3:]),
+                    [self.model_tester.num_attention_heads,
+                    self.model_tester.seq_length,
+                    self.model_tester.key_len if hasattr(self.model_tester, 'key_len') else self.model_tester.seq_length])
+                out_len = len(outputs)
+
+                # Check attention is always last and order is fine
+                config.output_attentions = True
+                config.output_hidden_states = True
+                model = model_class(config)
+                model.eval()
+                outputs = model(**inputs_dict)
+                self.assertEqual(out_len+1, len(outputs))
+                self.assertEqual(model.config.output_attentions, True)
+                self.assertEqual(model.config.output_hidden_states, True)
+
+                attentions = outputs[-1]
+                self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+                self.assertListEqual(
+                    list(attentions[0].shape[-3:]),
+                    [self.model_tester.num_attention_heads,
+                    self.model_tester.seq_length,
+                    self.model_tester.key_len if hasattr(self.model_tester, 'key_len') else self.model_tester.seq_length])
+
+        def test_torchscript(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            self._create_and_check_torchscript(config, inputs_dict)
+
+        def test_torchscript_output_attentions(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            config.output_attentions = True
+            self._create_and_check_torchscript(config, inputs_dict)
+
+        def test_torchscript_output_hidden_state(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            config.output_hidden_states = True
+            self._create_and_check_torchscript(config, inputs_dict)
+
+        def _create_and_check_torchscript(self, config, inputs_dict):
+            if not self.test_torchscript:
+                return
+
+            configs_no_init = _config_zero_init(config)  # To be sure we have no Nan
+            configs_no_init.torchscript = True
+            for model_class in self.all_model_classes:
+                model = model_class(config=configs_no_init)
+                model.eval()
+                inputs = inputs_dict['input_ids']  # Let's keep only input_ids
+
+                try:
+                    torch.jit.trace(model, inputs)
+                except RuntimeError:
+                    self.fail("Couldn't trace module.")
+
+                try:
+                    traced_gpt2 = torch.jit.trace(model, inputs)
+                    torch.jit.save(traced_gpt2, "traced_model.pt")
+                except RuntimeError:
+                    self.fail("Couldn't save module.")
+
+                try:
+                    loaded_model = torch.jit.load("traced_model.pt")
+                    os.remove("traced_model.pt")
+                except ValueError:
+                    self.fail("Couldn't load module.")
+
+                model.eval()
+                loaded_model.eval()
+
+                model_params = model.parameters()
+                loaded_model_params = loaded_model.parameters()
+
+                models_equal = True
+                for p1, p2 in zip(model_params, loaded_model_params):
+                    if p1.data.ne(p2.data).sum() > 0:
+                        models_equal = False
+
+                self.assertTrue(models_equal)
+
+
+        def test_headmasking(self):
+            if not self.test_head_masking:
+                return
+
+            global_rng.seed(42)
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            global_rng.seed()
+
+            config.output_attentions = True
+            config.output_hidden_states = True
+            configs_no_init = _config_zero_init(config)  # To be sure we have no Nan
+            for model_class in self.all_model_classes:
+                model = model_class(config=configs_no_init)
+                model.eval()
+
+                # Prepare head_mask
+                # Set require_grad after having prepared the tensor to avoid error (leaf variable has been moved into the graph interior)
+                head_mask = torch.ones(self.model_tester.num_hidden_layers, self.model_tester.num_attention_heads)
+                head_mask[0, 0] = 0
+                head_mask[-1, :-1] = 0
+                head_mask.requires_grad_(requires_grad=True)
+                inputs = inputs_dict.copy()
+                inputs['head_mask'] = head_mask
+
+                outputs = model(**inputs)
+
+                # Test that we can get a gradient back for importance score computation
+                output = sum(t.sum() for t in outputs[0])
+                output = output.sum()
+                output.backward()
+                multihead_outputs = head_mask.grad
+
+                attentions = outputs[-1]
+                hidden_states = outputs[-2]
+
+                # Remove Nan
+
+                self.assertIsNotNone(multihead_outputs)
+                self.assertEqual(len(multihead_outputs), self.model_tester.num_hidden_layers)
+                self.assertAlmostEqual(
+                    attentions[0][..., 0, :, :].flatten().sum().item(), 0.0)
+                self.assertNotEqual(
+                    attentions[0][..., -1, :, :].flatten().sum().item(), 0.0)
+                self.assertNotEqual(
+                    attentions[1][..., 0, :, :].flatten().sum().item(), 0.0)
+                self.assertAlmostEqual(
+                    attentions[-1][..., -2, :, :].flatten().sum().item(), 0.0)
+                self.assertNotEqual(
+                    attentions[-1][..., -1, :, :].flatten().sum().item(), 0.0)
+
+
+        def test_head_pruning(self):
+            if not self.test_pruning:
+                return
+
+            for model_class in self.all_model_classes:
+                config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+                if "head_mask" in inputs_dict:
+                    del inputs_dict["head_mask"]
+
+                config.output_attentions = True
+                config.output_hidden_states = False
+                model = model_class(config=config)
+                model.eval()
+                heads_to_prune = {0: list(range(1, self.model_tester.num_attention_heads)),
+                                -1: [0]}
+                model.prune_heads(heads_to_prune)
+                outputs = model(**inputs_dict)
+
+                attentions = outputs[-1]
+
+                self.assertEqual(
+                    attentions[0].shape[-3], 1)
+                self.assertEqual(
+                    attentions[1].shape[-3], self.model_tester.num_attention_heads)
+                self.assertEqual(
+                    attentions[-1].shape[-3], self.model_tester.num_attention_heads - 1)
+
+        def test_head_pruning_save_load_from_pretrained(self):
+            if not self.test_pruning:
+                return
+
+            for model_class in self.all_model_classes:
+                config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+                if "head_mask" in inputs_dict:
+                    del inputs_dict["head_mask"]
+
+                config.output_attentions = True
+                config.output_hidden_states = False
+                model = model_class(config=config)
+                model.eval()
+                heads_to_prune = {0: list(range(1, self.model_tester.num_attention_heads)),
+                                -1: [0]}
+                model.prune_heads(heads_to_prune)
+                directory = "pruned_model"
+                if not os.path.exists(directory):
+                    os.makedirs(directory)
+                model.save_pretrained(directory)
+                model = model_class.from_pretrained(directory)
+
+                outputs = model(**inputs_dict)
+                attentions = outputs[-1]
+                self.assertEqual(attentions[0].shape[-3], 1)
+                self.assertEqual(attentions[1].shape[-3], self.model_tester.num_attention_heads)
+                self.assertEqual(attentions[-1].shape[-3], self.model_tester.num_attention_heads - 1)
+
+                shutil.rmtree(directory)
+
+        def test_head_pruning_save_load_from_config_init(self):
+            if not self.test_pruning:
+                return
+
+            for model_class in self.all_model_classes:
+                config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+                if "head_mask" in inputs_dict:
+                    del inputs_dict["head_mask"]
+
+                config.output_attentions = True
+                config.output_hidden_states = False
+
+                heads_to_prune = {0: list(range(1, self.model_tester.num_attention_heads)),
+                                 -1: [0]}
+                config.pruned_heads = heads_to_prune
+
+                model = model_class(config=config)
+                model.eval()
+
+                outputs = model(**inputs_dict)
+                attentions = outputs[-1]
+
+                self.assertEqual(attentions[0].shape[-3], 1)
+                self.assertEqual(attentions[1].shape[-3], self.model_tester.num_attention_heads)
+                self.assertEqual(attentions[-1].shape[-3], self.model_tester.num_attention_heads - 1)
+
+        def test_head_pruning_integration(self):
+            if not self.test_pruning:
+                return
+
+            for model_class in self.all_model_classes:
+                config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+                if "head_mask" in inputs_dict:
+                    del inputs_dict["head_mask"]
+
+                config.output_attentions = True
+                config.output_hidden_states = False
+
+                heads_to_prune = {0: [0], 1: [1, 2]}
+                config.pruned_heads = heads_to_prune
+
+                model = model_class(config=config)
+                model.eval()
+
+                outputs = model(**inputs_dict)
+                attentions = outputs[-1]
+
+                self.assertEqual(attentions[0].shape[-3], self.model_tester.num_attention_heads - 1)
+                self.assertEqual(attentions[1].shape[-3], self.model_tester.num_attention_heads - 2)
+                self.assertEqual(attentions[2].shape[-3], self.model_tester.num_attention_heads)
+                self.assertEqual(attentions[3].shape[-3], self.model_tester.num_attention_heads)
+
+                directory = "pruned_model"
+
+                if not os.path.exists(directory):
+                    os.makedirs(directory)
+                model.save_pretrained(directory)
+                model = model_class.from_pretrained(directory)
+                shutil.rmtree(directory)
+
+                outputs = model(**inputs_dict)
+                attentions = outputs[-1]
+
+                self.assertEqual(attentions[0].shape[-3], self.model_tester.num_attention_heads - 1)
+                self.assertEqual(attentions[1].shape[-3], self.model_tester.num_attention_heads - 2)
+                self.assertEqual(attentions[2].shape[-3], self.model_tester.num_attention_heads)
+                self.assertEqual(attentions[3].shape[-3], self.model_tester.num_attention_heads)
+
+                heads_to_prune = {0: [0], 2: [1, 2]}
+                model.prune_heads(heads_to_prune)
+
+                outputs = model(**inputs_dict)
+                attentions = outputs[-1]
+
+                self.assertEqual(attentions[0].shape[-3], self.model_tester.num_attention_heads -1)
+                self.assertEqual(attentions[1].shape[-3], self.model_tester.num_attention_heads - 2)
+                self.assertEqual(attentions[2].shape[-3], self.model_tester.num_attention_heads - 2)
+                self.assertEqual(attentions[3].shape[-3], self.model_tester.num_attention_heads)
+
+                self.assertDictEqual(model.config.pruned_heads, {0: [0], 1: [1, 2], 2: [1, 2]})
+
+
+        def test_hidden_states_output(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            for model_class in self.all_model_classes:
+                config.output_hidden_states = True
+                config.output_attentions = False
+                model = model_class(config)
+                model.eval()
+                outputs = model(**inputs_dict)
+                hidden_states = outputs[-1]
+                self.assertEqual(model.config.output_attentions, False)
+                self.assertEqual(model.config.output_hidden_states, True)
+                self.assertEqual(len(hidden_states), self.model_tester.num_hidden_layers + 1)
+                self.assertListEqual(
+                    list(hidden_states[0].shape[-2:]),
+                    [self.model_tester.seq_length, self.model_tester.hidden_size])
+
+        def test_resize_tokens_embeddings(self):
+            original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            if not self.test_resize_embeddings:
+                return
+
+            for model_class in self.all_model_classes:
+                config = copy.deepcopy(original_config)
+                model = model_class(config)
+
+                model_vocab_size = config.vocab_size
+                # Retrieve the embeddings and clone theme
+                model_embed = model.resize_token_embeddings(model_vocab_size)
+                cloned_embeddings = model_embed.weight.clone()
+
+                # Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
+                model_embed = model.resize_token_embeddings(model_vocab_size + 10)
+                self.assertEqual(model.config.vocab_size, model_vocab_size + 10)
+                # Check that it actually resizes the embeddings matrix
+                self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] + 10)
+
+                # Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
+                model_embed = model.resize_token_embeddings(model_vocab_size - 15)
+                self.assertEqual(model.config.vocab_size, model_vocab_size - 15)
+                # Check that it actually resizes the embeddings matrix
+                self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] - 15)
+
+                # Check that adding and removing tokens has not modified the first part of the embedding matrix.
+                models_equal = True
+                for p1, p2 in zip(cloned_embeddings, model_embed.weight):
+                    if p1.data.ne(p2.data).sum() > 0:
+                        models_equal = False
+
+                self.assertTrue(models_equal)
+
+        def test_tie_model_weights(self):
+            if not self.test_torchscript:
+                return
+
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            def check_same_values(layer_1, layer_2):
+                equal = True
+                for p1, p2 in zip(layer_1.weight, layer_2.weight):
+                    if p1.data.ne(p2.data).sum() > 0:
+                        equal = False
+                return equal
+
+            for model_class in self.all_model_classes:
+                if not hasattr(model_class, 'tie_weights'):
+                    continue
+
+                config.torchscript = True
+                model_not_tied = model_class(config)
+                params_not_tied = list(model_not_tied.parameters())
+
+                config_tied = copy.deepcopy(config)
+                config_tied.torchscript = False
+                model_tied = model_class(config_tied)
+                params_tied = list(model_tied.parameters())
+
+                # Check that the embedding layer and decoding layer are the same in size and in value
+                self.assertGreater(len(params_not_tied), len(params_tied))
+                # self.assertTrue(check_same_values(embeddings, decoding))
+
+                # # Check that after modification, they remain the same.
+                # embeddings.weight.data.div_(2)
+                # # Check that the embedding layer and decoding layer are the same in size and in value
+                # self.assertTrue(embeddings.weight.shape, decoding.weight.shape)
+                # self.assertTrue(check_same_values(embeddings, decoding))
+
+                # # Check that after modification, they remain the same.
+                # decoding.weight.data.div_(4)
+                # # Check that the embedding layer and decoding layer are the same in size and in value
+                # self.assertTrue(embeddings.weight.shape, decoding.weight.shape)
+                # self.assertTrue(check_same_values(embeddings, decoding))
+
+                # Check that after resize they remain tied.
+                model_tied.resize_token_embeddings(config.vocab_size + 10)
+                params_tied_2 = list(model_tied.parameters())
+                self.assertGreater(len(params_not_tied), len(params_tied))
+                self.assertEqual(len(params_tied_2), len(params_tied))
+
+                # decoding.weight.data.mul_(20)
+                # # Check that the embedding layer and decoding layer are the same in size and in value
+                # self.assertTrue(model.transformer.wte.weight.shape, model.lm_head.weight.shape)
+                # self.assertTrue(check_same_values(model.transformer.wte, model.lm_head))
+
+
+    class GPTModelTester(CommonModelTester):
+
+        def __init__(self,
+                        parent,
+                        batch_size=13,
+                        seq_length=7,
+                        is_training=True,
+                        use_position_ids=True,
+                        use_token_type_ids=True,
+                        use_labels=True,
+                        vocab_size=99,
+                        n_positions=33,
+                        hidden_size=32,
+                        num_hidden_layers=5,
+                        num_attention_heads=4,
+                        n_choices=3,
+                        type_sequence_label_size=2,
+                        initializer_range=0.02,
+                        num_labels=3,
+                        scope=None,
+                        config_class=None,
+                        base_model_class=None,
+                        lm_head_model_class=None,
+                        double_head_model_class=None,
+                        ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_position_ids = use_position_ids
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.n_positions = n_positions
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.n_choices = n_choices
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.scope = scope
+            self.config_class = config_class
+            self.base_model_class = base_model_class
+            self.lm_head_model_class = lm_head_model_class
+            self.double_head_model_class = double_head_model_class
+            self.all_model_classes = (base_model_class, lm_head_model_class, double_head_model_class)
+
+        def prepare_config_and_inputs(self):
+            total_num_tokens = self.vocab_size
+            input_ids = ids_tensor([self.batch_size, self.n_choices, self.seq_length], total_num_tokens)
+
+            position_ids = None
+            if self.use_position_ids:
+                position_ids = ids_tensor([self.batch_size, self.n_choices, self.seq_length], self.n_positions)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                total_voc = self.vocab_size
+                token_type_ids = ids_tensor([self.batch_size, self.n_choices, self.seq_length], total_voc)
+
+            mc_labels = None
+            lm_labels = None
+            mc_token_ids = None
+            if self.use_labels:
+                mc_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                lm_labels = ids_tensor([self.batch_size, self.n_choices, self.seq_length], self.num_labels)
+                mc_token_ids = ids_tensor([self.batch_size, self.n_choices], self.seq_length)
+
+            config = self.config_class(
+                vocab_size_or_config_json_file=self.vocab_size,
+                n_positions=self.n_positions,
+                n_embd=self.hidden_size,
+                n_layer=self.num_hidden_layers,
+                n_head=self.num_attention_heads,
+                initializer_range=self.initializer_range)
+
+            return (config, input_ids, token_type_ids, position_ids,
+                    mc_labels, lm_labels, mc_token_ids)
+
+        def create_and_check_base_model(self, config, input_ids, token_type_ids, position_ids,
+                                mc_labels, lm_labels, mc_token_ids):
+            model = self.base_model_class(config)
+            model.eval()
+
+            outputs = model(input_ids, position_ids, token_type_ids)
+            outputs = model(input_ids, position_ids)
+            outputs = model(input_ids)
+
+            hidden_state = outputs[0]
+            self.parent.assertListEqual(
+                list(hidden_state.size()),
+                [self.batch_size, self.n_choices, self.seq_length, self.hidden_size])
+
+
+        def create_and_check_lm_head(self, config, input_ids, token_type_ids, position_ids,
+                                        mc_labels, lm_labels, mc_token_ids):
+            model = self.lm_head_model_class(config)
+            model.eval()
+            outputs = model(input_ids, position_ids, token_type_ids, lm_labels)
+            loss, lm_logits = outputs[:2]
+
+            total_voc = self.vocab_size
+            self.parent.assertListEqual(
+                list(lm_logits.size()),
+                [self.batch_size, self.n_choices, self.seq_length, total_voc])
+            self.parent.assertListEqual(
+                list(loss.size()),
+                [])
+
+        def create_and_check_presents(self, config, input_ids, token_type_ids, position_ids,
+                                        mc_labels, lm_labels, mc_token_ids):
+            for model_class in self.all_model_classes:
+                model = model_class(config)
+                model.eval()
+                outputs = model(input_ids)
+                presents = outputs[-1]
+                self.parent.assertEqual(self.num_hidden_layers, len(presents))
+                self.parent.assertListEqual(
+                    list(presents[0].size()),
+                    [2, self.batch_size * self.n_choices, self.num_attention_heads,
+                        self.seq_length, self.hidden_size // self.num_attention_heads])
+
+        def create_and_check_double_heads(self, config, input_ids, token_type_ids, position_ids,
+                                        mc_labels, lm_labels, mc_token_ids):
+            model = self.double_head_model_class(config)
+            model.eval()
+            outputs = model(input_ids, mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels,
+                            token_type_ids=token_type_ids, position_ids=position_ids)
+            lm_loss, mc_loss, lm_logits, mc_logits = outputs[:4]
+            loss = [lm_loss, mc_loss]
+
+            total_voc = self.vocab_size
+            self.parent.assertListEqual(
+                list(lm_logits.size()),
+                [self.batch_size, self.n_choices, self.seq_length, total_voc])
+            self.parent.assertListEqual(
+                list(mc_logits.size()),
+                [self.batch_size, self.n_choices])
+            self.parent.assertListEqual(
+                [list(l.size()) for l in loss],
+                [[], []])
+
+        def create_and_check_model_from_pretrained(self):
+            cache_dir = "/tmp/pytorch_transformers_test/"
+            for model_name in list(self.base_model_class.pretrained_model_archive_map.keys())[:1]:
+                model = self.base_model_class.from_pretrained(model_name, cache_dir=cache_dir)
+                shutil.rmtree(cache_dir)
+                self.parent.assertIsNotNone(model)
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, token_type_ids, position_ids,
+                mc_labels, lm_labels, mc_token_ids) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids}
+            return config, inputs_dict
+
+        def run_common_tests(self, test_presents=False):
+            config_and_inputs = self.prepare_config_and_inputs()
+            self.create_and_check_base_model(*config_and_inputs)
+
+            config_and_inputs = self.prepare_config_and_inputs()
+            self.create_and_check_lm_head(*config_and_inputs)
+
+            config_and_inputs = self.prepare_config_and_inputs()
+            self.create_and_check_double_heads(*config_and_inputs)
+
+            if test_presents:
+                config_and_inputs = self.prepare_config_and_inputs()
+                self.create_and_check_presents(*config_and_inputs)
+
+        def run_slow_tests(self):
+            self.create_and_check_model_from_pretrained()
+
+
+class ConfigTester(object):
+    def __init__(self, parent, config_class=None, **kwargs):
+        self.parent = parent
+        self.config_class = config_class
+        self.inputs_dict = kwargs
+
+    def create_and_test_config_common_properties(self):
+        config = self.config_class(**self.inputs_dict)
+        self.parent.assertTrue(hasattr(config, 'vocab_size'))
+        self.parent.assertTrue(hasattr(config, 'hidden_size'))
+        self.parent.assertTrue(hasattr(config, 'num_attention_heads'))
+        self.parent.assertTrue(hasattr(config, 'num_hidden_layers'))
+
+    def create_and_test_config_to_json_string(self):
+        config = self.config_class(**self.inputs_dict)
+        obj = json.loads(config.to_json_string())
+        for key, value in self.inputs_dict.items():
+            self.parent.assertEqual(obj[key], value)
+
+    def create_and_test_config_to_json_file(self):
+        config_first = self.config_class(**self.inputs_dict)
+        json_file_path = os.path.join(os.getcwd(), "config_" + str(uuid.uuid4()) + ".json")
+        config_first.to_json_file(json_file_path)
+        config_second = self.config_class.from_json_file(json_file_path)
+        os.remove(json_file_path)
+        self.parent.assertEqual(config_second.to_dict(), config_first.to_dict())
+
+    def run_common_tests(self):
+        self.create_and_test_config_common_properties()
+        self.create_and_test_config_to_json_string()
+        self.create_and_test_config_to_json_file()
+
+
+global_rng = random.Random()
+
+
+def ids_tensor(shape, vocab_size, rng=None, name=None):
+    """Creates a random int32 tensor of the shape within the vocab size."""
+    if rng is None:
+        rng = global_rng
+
+    total_dims = 1
+    for dim in shape:
+        total_dims *= dim
+
+    values = []
+    for _ in range(total_dims):
+        values.append(rng.randint(0, vocab_size - 1))
+
+    return torch.tensor(data=values, dtype=torch.long).view(shape).contiguous()
+
+
+class ModelUtilsTest(unittest.TestCase):
+    def test_model_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            config = BertConfig.from_pretrained(model_name)
+            self.assertIsNotNone(config)
+            self.assertIsInstance(config, PretrainedConfig)
+
+            model = BertModel.from_pretrained(model_name)
+            model, loading_info = BertModel.from_pretrained(model_name, output_loading_info=True)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(model, PreTrainedModel)
+            for value in loading_info.values():
+                self.assertEqual(len(value), 0)
+
+            config = BertConfig.from_pretrained(model_name, output_attentions=True, output_hidden_states=True)
+            model = BertModel.from_pretrained(model_name, output_attentions=True, output_hidden_states=True)
+            self.assertEqual(model.config.output_attentions, True)
+            self.assertEqual(model.config.output_hidden_states, True)
+            self.assertEqual(model.config, config)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_distilbert_test.py b/Optimus/code/pytorch_transformers/tests/modeling_distilbert_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..0d9f2311777c55b80b030ddfb8df89f7e412e10d
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_distilbert_test.py
@@ -0,0 +1,215 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+
+from pytorch_transformers import (DistilBertConfig, DistilBertModel, DistilBertForMaskedLM,
+                                  DistilBertForQuestionAnswering, DistilBertForSequenceClassification)
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class DistilBertModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (DistilBertModel, DistilBertForMaskedLM, DistilBertForQuestionAnswering,
+                         DistilBertForSequenceClassification)
+    test_pruning = True
+    test_torchscript = True
+    test_resize_embeddings = True
+    test_head_masking = True
+
+    class DistilBertModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_input_mask=True,
+                     use_token_type_ids=False,
+                     use_labels=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                    ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_input_mask = use_input_mask
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = DistilBertConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                dim=self.hidden_size,
+                n_layers=self.num_hidden_layers,
+                n_heads=self.num_attention_heads,
+                hidden_dim=self.intermediate_size,
+                hidden_act=self.hidden_act,
+                dropout=self.hidden_dropout_prob,
+                attention_dropout=self.attention_probs_dropout_prob,
+                max_position_embeddings=self.max_position_embeddings,
+                initializer_range=self.initializer_range)
+
+            return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_distilbert_model(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DistilBertModel(config=config)
+            model.eval()
+            (sequence_output,) = model(input_ids, input_mask)
+            (sequence_output,) = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+
+        def create_and_check_distilbert_for_masked_lm(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DistilBertForMaskedLM(config=config)
+            model.eval()
+            loss, prediction_scores = model(input_ids, attention_mask=input_mask, masked_lm_labels=token_labels)
+            result = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+            }
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.check_loss_output(result)
+
+        def create_and_check_distilbert_for_question_answering(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DistilBertForQuestionAnswering(config=config)
+            model.eval()
+            loss, start_logits, end_logits = model(input_ids, attention_mask=input_mask, start_positions=sequence_labels, end_positions=sequence_labels)
+            result = {
+                "loss": loss,
+                "start_logits": start_logits,
+                "end_logits": end_logits,
+            }
+            self.parent.assertListEqual(
+                list(result["start_logits"].size()),
+                [self.batch_size, self.seq_length])
+            self.parent.assertListEqual(
+                list(result["end_logits"].size()),
+                [self.batch_size, self.seq_length])
+            self.check_loss_output(result)
+
+        def create_and_check_distilbert_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            config.num_labels = self.num_labels
+            model = DistilBertForSequenceClassification(config)
+            model.eval()
+            loss, logits = model(input_ids, attention_mask=input_mask, labels=sequence_labels)
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.num_labels])
+            self.check_loss_output(result)
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, input_mask, sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids, 'attention_mask': input_mask}
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = DistilBertModelTest.DistilBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=DistilBertConfig, dim=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_distilbert_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_distilbert_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_distilbert_for_masked_lm(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_distilbert_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_distilbert_for_sequence_classification(*config_and_inputs)
+
+    # @pytest.mark.slow
+    # def test_model_from_pretrained(self):
+    #     cache_dir = "/tmp/pytorch_transformers_test/"
+    #     for model_name in list(DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+    #         model = DistilBertModel.from_pretrained(model_name, cache_dir=cache_dir)
+    #         shutil.rmtree(cache_dir)
+    #         self.assertIsNotNone(model)
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_gpt2_test.py b/Optimus/code/pytorch_transformers/tests/modeling_gpt2_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..2717805120eaffc31a5536ce2656f12d6f4a8435
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_gpt2_test.py
@@ -0,0 +1,214 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import pytest
+import shutil
+
+
+from pytorch_transformers import (GPT2Config, GPT2Model, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                  GPT2LMHeadModel, GPT2DoubleHeadsModel)
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class GPT2ModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (GPT2Model, GPT2LMHeadModel, GPT2DoubleHeadsModel)
+
+    class GPT2ModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_token_type_ids=True,
+                     use_labels=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                     ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = GPT2Config(
+                vocab_size_or_config_json_file=self.vocab_size,
+                n_embd=self.hidden_size,
+                n_layer=self.num_hidden_layers,
+                n_head=self.num_attention_heads,
+                # intermediate_size=self.intermediate_size,
+                # hidden_act=self.hidden_act,
+                # hidden_dropout_prob=self.hidden_dropout_prob,
+                # attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+                n_positions=self.max_position_embeddings,
+                n_ctx=self.max_position_embeddings
+                # type_vocab_size=self.type_vocab_size,
+                # initializer_range=self.initializer_range
+            )
+
+            head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+            return config, input_ids, head_mask, token_type_ids, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_gpt2_model(self, config, input_ids, head_mask, token_type_ids, *args):
+            model = GPT2Model(config=config)
+            model.eval()
+
+            model(input_ids, token_type_ids=token_type_ids, head_mask=head_mask)
+            model(input_ids, token_type_ids=token_type_ids)
+            sequence_output, presents = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output,
+                "presents": presents,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertEqual(len(result["presents"]), config.n_layer)
+
+        def create_and_check_lm_head_model(self, config, input_ids, head_mask, token_type_ids, *args):
+            model = GPT2LMHeadModel(config)
+            model.eval()
+
+            loss, lm_logits, _ = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
+
+            result = {
+                "loss": loss,
+                "lm_logits": lm_logits
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["lm_logits"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+        def create_and_check_double_lm_head_model(self, config, input_ids, head_mask, token_type_ids, *args):
+            model = GPT2DoubleHeadsModel(config)
+            model.eval()
+
+            loss, lm_logits, mc_logits, _ = model(input_ids, token_type_ids=token_type_ids, lm_labels=input_ids)
+
+            result = {
+                "loss": loss,
+                "lm_logits": lm_logits
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["lm_logits"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, head_mask, token_type_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {
+                'input_ids': input_ids,
+                'token_type_ids': token_type_ids,
+                'head_mask': head_mask
+            }
+
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = GPT2ModelTest.GPT2ModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=GPT2Config, n_embd=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_gpt2_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt2_model(*config_and_inputs)
+
+    def test_gpt2_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    def test_gpt2_double_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_double_lm_head_model(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(GPT2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = GPT2Model.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_openai_test.py b/Optimus/code/pytorch_transformers/tests/modeling_openai_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..dbef6c52eb8e1caadfdf3c6dcd1d825615197346
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_openai_test.py
@@ -0,0 +1,212 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import pytest
+import shutil
+
+
+from pytorch_transformers import (OpenAIGPTConfig, OpenAIGPTModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                  OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel)
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class OpenAIGPTModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (OpenAIGPTModel, OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel)
+
+    class OpenAIGPTModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_token_type_ids=True,
+                     use_labels=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                     ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = OpenAIGPTConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                n_embd=self.hidden_size,
+                n_layer=self.num_hidden_layers,
+                n_head=self.num_attention_heads,
+                # intermediate_size=self.intermediate_size,
+                # hidden_act=self.hidden_act,
+                # hidden_dropout_prob=self.hidden_dropout_prob,
+                # attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+                n_positions=self.max_position_embeddings,
+                n_ctx=self.max_position_embeddings
+                # type_vocab_size=self.type_vocab_size,
+                # initializer_range=self.initializer_range
+            )
+
+            head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+            return config, input_ids, head_mask, token_type_ids, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_openai_gpt_model(self, config, input_ids, head_mask, token_type_ids, *args):
+            model = OpenAIGPTModel(config=config)
+            model.eval()
+
+            model(input_ids, token_type_ids=token_type_ids, head_mask=head_mask)
+            model(input_ids, token_type_ids=token_type_ids)
+            (sequence_output,) = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+
+        def create_and_check_lm_head_model(self, config, input_ids, head_mask, token_type_ids, *args):
+            model = OpenAIGPTLMHeadModel(config)
+            model.eval()
+
+            loss, lm_logits = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
+
+            result = {
+                "loss": loss,
+                "lm_logits": lm_logits
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["lm_logits"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+        def create_and_check_double_lm_head_model(self, config, input_ids, head_mask, token_type_ids, *args):
+            model = OpenAIGPTDoubleHeadsModel(config)
+            model.eval()
+
+            loss, lm_logits, mc_logits = model(input_ids, token_type_ids=token_type_ids, lm_labels=input_ids)
+
+            result = {
+                "loss": loss,
+                "lm_logits": lm_logits
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["lm_logits"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, head_mask, token_type_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {
+                'input_ids': input_ids,
+                'token_type_ids': token_type_ids,
+                'head_mask': head_mask
+            }
+
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = OpenAIGPTModelTest.OpenAIGPTModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=OpenAIGPTConfig, n_embd=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_openai_gpt_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_openai_gpt_model(*config_and_inputs)
+
+    def test_openai_gpt_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    def test_openai_gpt_double_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_double_lm_head_model(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = OpenAIGPTModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_roberta_test.py b/Optimus/code/pytorch_transformers/tests/modeling_roberta_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..69981af22275ae58b39380c69c4581b2eb2924a3
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_roberta_test.py
@@ -0,0 +1,243 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+import torch
+
+from pytorch_transformers import (RobertaConfig, RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification)
+from pytorch_transformers.modeling_roberta import ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class RobertaModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (RobertaForMaskedLM, RobertaModel)
+
+    class RobertaModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_input_mask=True,
+                     use_token_type_ids=True,
+                     use_labels=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                    ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_input_mask = use_input_mask
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = RobertaConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                hidden_size=self.hidden_size,
+                num_hidden_layers=self.num_hidden_layers,
+                num_attention_heads=self.num_attention_heads,
+                intermediate_size=self.intermediate_size,
+                hidden_act=self.hidden_act,
+                hidden_dropout_prob=self.hidden_dropout_prob,
+                attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+                max_position_embeddings=self.max_position_embeddings,
+                type_vocab_size=self.type_vocab_size,
+                initializer_range=self.initializer_range)
+
+            return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_roberta_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels,
+                                           token_labels, choice_labels):
+            model = RobertaModel(config=config)
+            model.eval()
+            sequence_output, pooled_output = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+            sequence_output, pooled_output = model(input_ids, token_type_ids=token_type_ids)
+            sequence_output, pooled_output = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output,
+                "pooled_output": pooled_output,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
+
+        def create_and_check_roberta_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels,
+                                                   token_labels, choice_labels):
+            model = RobertaForMaskedLM(config=config)
+            model.eval()
+            loss, prediction_scores = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, masked_lm_labels=token_labels)
+            result = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+            }
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.check_loss_output(result)
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, token_type_ids, input_mask,
+             sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = RobertaModelTest.RobertaModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=RobertaConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_roberta_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_roberta_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_roberta_for_masked_lm(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = RobertaModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+
+
+class RobertaModelIntegrationTest(unittest.TestCase):
+
+    @pytest.mark.slow
+    def test_inference_masked_lm(self):
+        model = RobertaForMaskedLM.from_pretrained('roberta-base')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        expected_shape = torch.Size((1, 11, 50265))
+        self.assertEqual(
+            output.shape,
+            expected_shape
+        )
+        # compare the actual values for a slice.
+        expected_slice = torch.Tensor(
+            [[[33.8843, -4.3107, 22.7779],
+              [ 4.6533, -2.8099, 13.6252],
+              [ 1.8222, -3.6898,  8.8600]]]
+        )
+        self.assertTrue(
+            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+        )
+
+    @pytest.mark.slow
+    def test_inference_no_head(self):
+        model = RobertaModel.from_pretrained('roberta-base')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        # compare the actual values for a slice.
+        expected_slice = torch.Tensor(
+            [[[-0.0231,  0.0782,  0.0074],
+              [-0.1854,  0.0539, -0.0174],
+              [ 0.0548,  0.0799,  0.1687]]]
+        )
+        self.assertTrue(
+            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+        )
+
+    @pytest.mark.slow
+    def test_inference_classification_head(self):
+        model = RobertaForSequenceClassification.from_pretrained('roberta-large-mnli')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        expected_shape = torch.Size((1, 3))
+        self.assertEqual(
+            output.shape,
+            expected_shape
+        )
+        expected_tensor = torch.Tensor([[-0.9469,  0.3913,  0.5118]])
+        self.assertTrue(
+            torch.allclose(output, expected_tensor, atol=1e-3)
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_transfo_xl_test.py b/Optimus/code/pytorch_transformers/tests/modeling_transfo_xl_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..f482c47202245c6cfc92acef6000c720b05350fc
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_transfo_xl_test.py
@@ -0,0 +1,213 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import random
+import shutil
+import pytest
+
+import torch
+
+from pytorch_transformers import (TransfoXLConfig, TransfoXLModel, TransfoXLLMHeadModel)
+from pytorch_transformers.modeling_transfo_xl import TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+class TransfoXLModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (TransfoXLModel, TransfoXLLMHeadModel)
+    test_pruning = False
+    test_torchscript = False
+    test_resize_embeddings = False
+
+    class TransfoXLModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     mem_len=30,
+                     clamp_len=15,
+                     is_training=True,
+                     use_labels=True,
+                     vocab_size=99,
+                     cutoffs=[10, 50, 80],
+                     hidden_size=32,
+                     d_embed=32,
+                     num_attention_heads=4,
+                     d_head=8,
+                     d_inner=128,
+                     div_val=2,
+                     num_hidden_layers=5,
+                     scope=None,
+                     seed=1,
+                     ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.mem_len = mem_len
+            self.key_len = seq_length + mem_len
+            self.clamp_len = clamp_len
+            self.is_training = is_training
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.cutoffs = cutoffs
+            self.hidden_size = hidden_size
+            self.d_embed = d_embed
+            self.num_attention_heads = num_attention_heads
+            self.d_head = d_head
+            self.d_inner = d_inner
+            self.div_val = div_val
+            self.num_hidden_layers = num_hidden_layers
+            self.scope = scope
+            self.seed = seed
+
+        def prepare_config_and_inputs(self):
+            input_ids_1 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+            input_ids_2 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            lm_labels = None
+            if self.use_labels:
+                lm_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            config = TransfoXLConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                mem_len=self.mem_len,
+                clamp_len=self.clamp_len,
+                cutoffs=self.cutoffs,
+                d_model=self.hidden_size,
+                d_embed=self.d_embed,
+                n_head=self.num_attention_heads,
+                d_head=self.d_head,
+                d_inner=self.d_inner,
+                div_val=self.div_val,
+                n_layer=self.num_hidden_layers)
+
+            return (config, input_ids_1, input_ids_2, lm_labels)
+
+        def set_seed(self):
+            random.seed(self.seed)
+            torch.manual_seed(self.seed)
+
+        def create_transfo_xl_model(self, config, input_ids_1, input_ids_2, lm_labels):
+            model = TransfoXLModel(config)
+            model.eval()
+
+            hidden_states_1, mems_1 = model(input_ids_1)
+            hidden_states_2, mems_2 = model(input_ids_2, mems_1)
+            outputs = {
+                "hidden_states_1": hidden_states_1,
+                "mems_1": mems_1,
+                "hidden_states_2": hidden_states_2,
+                "mems_2": mems_2,
+            }
+            return outputs
+
+        def check_transfo_xl_model_output(self, result):
+            self.parent.assertListEqual(
+                list(result["hidden_states_1"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertListEqual(
+                list(result["hidden_states_2"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_1"]),
+                [[self.mem_len, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_2"]),
+                [[self.mem_len, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+
+        def create_transfo_xl_lm_head(self, config, input_ids_1, input_ids_2, lm_labels):
+            model = TransfoXLLMHeadModel(config)
+            model.eval()
+
+            lm_logits_1, mems_1 = model(input_ids_1)
+            loss_1, _, mems_1 = model(input_ids_1, labels=lm_labels)
+            lm_logits_2, mems_2 = model(input_ids_2, mems=mems_1)
+            loss_2, _, mems_2 = model(input_ids_2, labels=lm_labels, mems=mems_1)
+
+            outputs = {
+                "loss_1": loss_1,
+                "mems_1": mems_1,
+                "lm_logits_1": lm_logits_1,
+                "loss_2": loss_2,
+                "mems_2": mems_2,
+                "lm_logits_2": lm_logits_2,
+            }
+            return outputs
+
+        def check_transfo_xl_lm_head_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss_1"].size()),
+                [self.batch_size, self.seq_length])
+            self.parent.assertListEqual(
+                list(result["lm_logits_1"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_1"]),
+                [[self.mem_len, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+            self.parent.assertListEqual(
+                list(result["loss_2"].size()),
+                [self.batch_size, self.seq_length])
+            self.parent.assertListEqual(
+                list(result["lm_logits_2"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_2"]),
+                [[self.mem_len, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids_1, input_ids_2, lm_labels) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids_1}
+            return config, inputs_dict
+
+
+    def setUp(self):
+        self.model_tester = TransfoXLModelTest.TransfoXLModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=TransfoXLConfig, d_embed=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_transfo_xl_model(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        output_result = self.model_tester.create_transfo_xl_model(*config_and_inputs)
+        self.model_tester.check_transfo_xl_model_output(output_result)
+
+    def test_transfo_xl_lm_head(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        output_result = self.model_tester.create_transfo_xl_lm_head(*config_and_inputs)
+        self.model_tester.check_transfo_xl_lm_head_output(output_result)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = TransfoXLModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_xlm_test.py b/Optimus/code/pytorch_transformers/tests/modeling_xlm_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..dcd09634770be8986d51de85c546e5bc555045de
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_xlm_test.py
@@ -0,0 +1,294 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+
+from pytorch_transformers import (XLMConfig, XLMModel, XLMWithLMHeadModel, XLMForQuestionAnswering, XLMForSequenceClassification)
+from pytorch_transformers.modeling_xlm import XLM_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class XLMModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (XLMModel, XLMWithLMHeadModel,  
+                         XLMForQuestionAnswering, XLMForSequenceClassification) 
+                         # , XLMForSequenceClassification, XLMForTokenClassification),
+
+    class XLMModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_input_lengths=True,
+                     use_token_type_ids=True,
+                     use_labels=True,
+                     gelu_activation=True,
+                     sinusoidal_embeddings=False,
+                     causal=False,
+                     asm=False,
+                     n_langs=2,
+                     vocab_size=99,
+                     n_special=0,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     summary_type="last",
+                     use_proj=True,
+                     scope=None,
+                    ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_input_lengths = use_input_lengths
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.gelu_activation = gelu_activation
+            self.sinusoidal_embeddings = sinusoidal_embeddings
+            self.asm = asm
+            self.n_langs = n_langs
+            self.vocab_size = vocab_size
+            self.n_special = n_special
+            self.summary_type = summary_type
+            self.causal = causal
+            self.use_proj = use_proj
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.n_langs = n_langs
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.summary_type = summary_type
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+            input_mask = ids_tensor([self.batch_size, self.seq_length], 2).float()
+
+            input_lengths = None
+            if self.use_input_lengths:
+                input_lengths = ids_tensor([self.batch_size], vocab_size=2) + self.seq_length - 2  # small variation of seq_length
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.n_langs)
+
+            sequence_labels = None
+            token_labels = None
+            is_impossible_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                is_impossible_labels = ids_tensor([self.batch_size], 2).float()
+
+            config = XLMConfig(
+                 vocab_size_or_config_json_file=self.vocab_size,
+                 n_special=self.n_special,
+                 emb_dim=self.hidden_size,
+                 n_layers=self.num_hidden_layers,
+                 n_heads=self.num_attention_heads,
+                 dropout=self.hidden_dropout_prob,
+                 attention_dropout=self.attention_probs_dropout_prob,
+                 gelu_activation=self.gelu_activation,
+                 sinusoidal_embeddings=self.sinusoidal_embeddings,
+                 asm=self.asm,
+                 causal=self.causal,
+                 n_langs=self.n_langs,
+                 max_position_embeddings=self.max_position_embeddings,
+                 initializer_range=self.initializer_range,
+                 summary_type=self.summary_type,
+                 use_proj=self.use_proj)
+
+            return config, input_ids, token_type_ids, input_lengths, sequence_labels, token_labels, is_impossible_labels, input_mask
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_xlm_model(self, config, input_ids, token_type_ids, input_lengths, sequence_labels, token_labels, is_impossible_labels, input_mask):
+            model = XLMModel(config=config)
+            model.eval()
+            outputs = model(input_ids, lengths=input_lengths, langs=token_type_ids)
+            outputs = model(input_ids, langs=token_type_ids)
+            outputs = model(input_ids)
+            sequence_output = outputs[0]
+            result = {
+                "sequence_output": sequence_output,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+
+
+        def create_and_check_xlm_lm_head(self, config, input_ids, token_type_ids, input_lengths, sequence_labels, token_labels, is_impossible_labels, input_mask):
+            model = XLMWithLMHeadModel(config)
+            model.eval()
+
+            loss, logits = model(input_ids, token_type_ids=token_type_ids, labels=token_labels)
+
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+
+        def create_and_check_xlm_qa(self, config, input_ids, token_type_ids, input_lengths, sequence_labels, token_labels, is_impossible_labels, input_mask):
+            model = XLMForQuestionAnswering(config)
+            model.eval()
+
+            outputs = model(input_ids)
+            start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits, mems = outputs
+
+            outputs = model(input_ids, start_positions=sequence_labels,
+                                         end_positions=sequence_labels,
+                                         cls_index=sequence_labels,
+                                         is_impossible=is_impossible_labels,
+                                         p_mask=input_mask)
+
+            outputs = model(input_ids, start_positions=sequence_labels,
+                                         end_positions=sequence_labels,
+                                         cls_index=sequence_labels,
+                                         is_impossible=is_impossible_labels)
+
+            (total_loss,) = outputs
+
+            outputs = model(input_ids, start_positions=sequence_labels,
+                                         end_positions=sequence_labels)
+
+            (total_loss,) = outputs
+
+            result = {
+                "loss": total_loss,
+                "start_top_log_probs": start_top_log_probs,
+                "start_top_index": start_top_index,
+                "end_top_log_probs": end_top_log_probs,
+                "end_top_index": end_top_index,
+                "cls_logits": cls_logits,
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["start_top_log_probs"].size()),
+                [self.batch_size, model.config.start_n_top])
+            self.parent.assertListEqual(
+                list(result["start_top_index"].size()),
+                [self.batch_size, model.config.start_n_top])
+            self.parent.assertListEqual(
+                list(result["end_top_log_probs"].size()),
+                [self.batch_size, model.config.start_n_top * model.config.end_n_top])
+            self.parent.assertListEqual(
+                list(result["end_top_index"].size()),
+                [self.batch_size, model.config.start_n_top * model.config.end_n_top])
+            self.parent.assertListEqual(
+                list(result["cls_logits"].size()),
+                [self.batch_size])
+
+
+        def create_and_check_xlm_sequence_classif(self, config, input_ids, token_type_ids, input_lengths, sequence_labels, token_labels, is_impossible_labels, input_mask):
+            model = XLMForSequenceClassification(config)
+            model.eval()
+
+            (logits,) = model(input_ids)
+            loss, logits = model(input_ids, labels=sequence_labels)
+
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.type_sequence_label_size])
+
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, token_type_ids, input_lengths,
+             sequence_labels, token_labels, is_impossible_labels, input_mask) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'lengths': input_lengths}
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = XLMModelTest.XLMModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=XLMConfig, emb_dim=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_xlm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlm_model(*config_and_inputs)
+
+        # config_and_inputs = tester.prepare_config_and_inputs()
+        # tester.create_and_check_xlm_for_masked_lm(*config_and_inputs)
+
+        # config_and_inputs = tester.prepare_config_and_inputs()
+        # tester.create_and_check_xlm_for_multiple_choice(*config_and_inputs)
+
+        # config_and_inputs = tester.prepare_config_and_inputs()
+        # tester.create_and_check_xlm_for_question_answering(*config_and_inputs)
+
+        # config_and_inputs = tester.prepare_config_and_inputs()
+        # tester.create_and_check_xlm_for_sequence_classification(*config_and_inputs)
+
+        # config_and_inputs = tester.prepare_config_and_inputs()
+        # tester.create_and_check_xlm_for_token_classification(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(XLM_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = XLMModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/modeling_xlnet_test.py b/Optimus/code/pytorch_transformers/tests/modeling_xlnet_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..4445bc17ac4e69bad8e65485d91b56248ec198af
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/modeling_xlnet_test.py
@@ -0,0 +1,323 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import unittest
+import json
+import random
+import shutil
+import pytest
+
+import torch
+
+from pytorch_transformers import (XLNetConfig, XLNetModel, XLNetLMHeadModel, XLNetForSequenceClassification, XLNetForQuestionAnswering)
+from pytorch_transformers.modeling_xlnet import XLNET_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+class XLNetModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes=(XLNetModel, XLNetLMHeadModel,
+                    XLNetForSequenceClassification, XLNetForQuestionAnswering)
+    test_pruning = False
+
+    class XLNetModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     mem_len=10,
+                     clamp_len=-1,
+                     reuse_len=15,
+                     is_training=True,
+                     use_labels=True,
+                     vocab_size=99,
+                     cutoffs=[10, 50, 80],
+                     hidden_size=32,
+                     num_attention_heads=4,
+                     d_inner=128,
+                     num_hidden_layers=5,
+                     max_position_embeddings=10,
+                     type_sequence_label_size=2,
+                     untie_r=True,
+                     bi_data=False,
+                     same_length=False,
+                     initializer_range=0.05,
+                     seed=1,
+                     type_vocab_size=2,
+            ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.mem_len = mem_len
+            # self.key_len = seq_length + mem_len
+            self.clamp_len = clamp_len
+            self.reuse_len = reuse_len
+            self.is_training = is_training
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.cutoffs = cutoffs
+            self.hidden_size = hidden_size
+            self.num_attention_heads = num_attention_heads
+            self.d_inner = d_inner
+            self.num_hidden_layers = num_hidden_layers
+            self.max_position_embeddings = max_position_embeddings
+            self.bi_data = bi_data
+            self.untie_r = untie_r
+            self.same_length = same_length
+            self.initializer_range = initializer_range
+            self.seed = seed
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+
+        def prepare_config_and_inputs(self):
+            input_ids_1 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+            input_ids_2 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+            segment_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+            input_mask = ids_tensor([self.batch_size, self.seq_length], 2).float()
+
+            input_ids_q = ids_tensor([self.batch_size, self.seq_length + 1], self.vocab_size)
+            perm_mask = torch.zeros(self.batch_size, self.seq_length + 1, self.seq_length + 1, dtype=torch.float)
+            perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+            target_mapping = torch.zeros(self.batch_size, 1, self.seq_length + 1, dtype=torch.float)
+            target_mapping[:, 0, -1] = 1.0  # predict last token
+
+            sequence_labels = None
+            lm_labels = None
+            is_impossible_labels = None
+            if self.use_labels:
+                lm_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                is_impossible_labels = ids_tensor([self.batch_size], 2).float()
+
+            config = XLNetConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                d_model=self.hidden_size,
+                n_head=self.num_attention_heads,
+                d_inner=self.d_inner,
+                n_layer=self.num_hidden_layers,
+                untie_r=self.untie_r,
+                max_position_embeddings=self.max_position_embeddings,
+                mem_len=self.mem_len,
+                clamp_len=self.clamp_len,
+                same_length=self.same_length,
+                reuse_len=self.reuse_len,
+                bi_data=self.bi_data,
+                initializer_range=self.initializer_range,
+                num_labels=self.type_sequence_label_size)
+
+            return (config, input_ids_1, input_ids_2, input_ids_q, perm_mask, input_mask,
+                    target_mapping, segment_ids, lm_labels, sequence_labels, is_impossible_labels)
+
+        def set_seed(self):
+            random.seed(self.seed)
+            torch.manual_seed(self.seed)
+
+        def create_and_check_xlnet_base_model(self, config, input_ids_1, input_ids_2, input_ids_q, perm_mask, input_mask,
+                target_mapping, segment_ids, lm_labels, sequence_labels, is_impossible_labels):
+            model = XLNetModel(config)
+            model.eval()
+
+            _, _ = model(input_ids_1, input_mask=input_mask)
+            _, _ = model(input_ids_1, attention_mask=input_mask)
+            _, _ = model(input_ids_1, token_type_ids=segment_ids)
+            outputs, mems_1 = model(input_ids_1)
+
+            result = {
+                "mems_1": mems_1,
+                "outputs": outputs,
+            }
+
+            self.parent.assertListEqual(
+                list(result["outputs"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_1"]),
+                [[self.seq_length, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+        def create_and_check_xlnet_lm_head(self, config, input_ids_1, input_ids_2, input_ids_q, perm_mask, input_mask,
+                target_mapping, segment_ids, lm_labels, sequence_labels, is_impossible_labels):
+            model = XLNetLMHeadModel(config)
+            model.eval()
+
+            loss_1, all_logits_1, mems_1 = model(input_ids_1, token_type_ids=segment_ids, labels=lm_labels)
+
+            loss_2, all_logits_2, mems_2 = model(input_ids_2, token_type_ids=segment_ids, labels=lm_labels, mems=mems_1)
+
+            logits, _ = model(input_ids_q, perm_mask=perm_mask, target_mapping=target_mapping)
+
+            result = {
+                "loss_1": loss_1,
+                "mems_1": mems_1,
+                "all_logits_1": all_logits_1,
+                "loss_2": loss_2,
+                "mems_2": mems_2,
+                "all_logits_2": all_logits_2,
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss_1"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["all_logits_1"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_1"]),
+                [[self.seq_length, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+            self.parent.assertListEqual(
+                list(result["loss_2"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["all_logits_2"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_2"]),
+                [[self.mem_len, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+        def create_and_check_xlnet_qa(self, config, input_ids_1, input_ids_2, input_ids_q, perm_mask, input_mask,
+                target_mapping, segment_ids, lm_labels, sequence_labels, is_impossible_labels):
+            model = XLNetForQuestionAnswering(config)
+            model.eval()
+
+            outputs = model(input_ids_1)
+            start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits, mems = outputs
+
+            outputs = model(input_ids_1, start_positions=sequence_labels,
+                                         end_positions=sequence_labels,
+                                         cls_index=sequence_labels,
+                                         is_impossible=is_impossible_labels,
+                                         p_mask=input_mask)
+
+            outputs = model(input_ids_1, start_positions=sequence_labels,
+                                         end_positions=sequence_labels,
+                                         cls_index=sequence_labels,
+                                         is_impossible=is_impossible_labels)
+
+            total_loss, mems = outputs
+
+            outputs = model(input_ids_1, start_positions=sequence_labels,
+                                         end_positions=sequence_labels)
+
+            total_loss, mems = outputs
+
+            result = {
+                "loss": total_loss,
+                "start_top_log_probs": start_top_log_probs,
+                "start_top_index": start_top_index,
+                "end_top_log_probs": end_top_log_probs,
+                "end_top_index": end_top_index,
+                "cls_logits": cls_logits,
+                "mems": mems,
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["start_top_log_probs"].size()),
+                [self.batch_size, model.config.start_n_top])
+            self.parent.assertListEqual(
+                list(result["start_top_index"].size()),
+                [self.batch_size, model.config.start_n_top])
+            self.parent.assertListEqual(
+                list(result["end_top_log_probs"].size()),
+                [self.batch_size, model.config.start_n_top * model.config.end_n_top])
+            self.parent.assertListEqual(
+                list(result["end_top_index"].size()),
+                [self.batch_size, model.config.start_n_top * model.config.end_n_top])
+            self.parent.assertListEqual(
+                list(result["cls_logits"].size()),
+                [self.batch_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems"]),
+                [[self.seq_length, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+        def create_and_check_xlnet_sequence_classif(self, config, input_ids_1, input_ids_2, input_ids_q, perm_mask, input_mask,
+                target_mapping, segment_ids, lm_labels, sequence_labels, is_impossible_labels):
+            model = XLNetForSequenceClassification(config)
+            model.eval()
+
+            logits, mems_1 = model(input_ids_1)
+            loss, logits, mems_1 = model(input_ids_1, labels=sequence_labels)
+
+            result = {
+                "loss": loss,
+                "mems_1": mems_1,
+                "logits": logits,
+            }
+
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.type_sequence_label_size])
+            self.parent.assertListEqual(
+                list(list(mem.size()) for mem in result["mems_1"]),
+                [[self.seq_length, self.batch_size, self.hidden_size]] * self.num_hidden_layers)
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids_1, input_ids_2, input_ids_q, perm_mask, input_mask,
+                target_mapping, segment_ids, lm_labels,
+                sequence_labels, is_impossible_labels) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids_1}
+            return config, inputs_dict
+
+
+    def setUp(self):
+        self.model_tester = XLNetModelTest.XLNetModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=XLNetConfig, d_inner=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_xlnet_base_model(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_base_model(*config_and_inputs)
+
+    def test_xlnet_lm_head(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_lm_head(*config_and_inputs) 
+
+    def test_xlnet_sequence_classif(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_sequence_classif(*config_and_inputs)
+
+    def test_xlnet_qa(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_qa(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(XLNET_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = XLNetModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/optimization_test.py b/Optimus/code/pytorch_transformers/tests/optimization_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..014654158274df4b81d9c441851e15c8d54eb6c2
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/optimization_test.py
@@ -0,0 +1,139 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import os
+
+import torch
+
+from pytorch_transformers import (AdamW, ConstantLRSchedule, WarmupConstantSchedule,
+                                  WarmupCosineSchedule, WarmupCosineWithHardRestartsSchedule, WarmupLinearSchedule)
+
+from .tokenization_tests_commons import TemporaryDirectory
+
+
+def unwrap_schedule(scheduler, num_steps=10):
+    lrs = []
+    for _ in range(num_steps):
+        scheduler.step()
+        lrs.append(scheduler.get_lr())
+    return lrs
+
+def unwrap_and_save_reload_schedule(scheduler, num_steps=10):
+    lrs = []
+    for step in range(num_steps):
+        scheduler.step()
+        lrs.append(scheduler.get_lr())
+        if step == num_steps // 2:
+            with TemporaryDirectory() as tmpdirname:
+                file_name = os.path.join(tmpdirname, 'schedule.bin')
+                torch.save(scheduler.state_dict(), file_name)
+
+                state_dict = torch.load(file_name)
+                scheduler.load_state_dict(state_dict)
+    return lrs
+
+class OptimizationTest(unittest.TestCase):
+
+    def assertListAlmostEqual(self, list1, list2, tol):
+        self.assertEqual(len(list1), len(list2))
+        for a, b in zip(list1, list2):
+            self.assertAlmostEqual(a, b, delta=tol)
+
+    def test_adam_w(self):
+        w = torch.tensor([0.1, -0.2, -0.1], requires_grad=True)
+        target = torch.tensor([0.4, 0.2, -0.5])
+        criterion = torch.nn.MSELoss()
+        # No warmup, constant schedule, no gradient clipping
+        optimizer = AdamW(params=[w], lr=2e-1, weight_decay=0.0)
+        for _ in range(100):
+            loss = criterion(w, target)
+            loss.backward()
+            optimizer.step()
+            w.grad.detach_() # No zero_grad() function on simple tensors. we do it ourselves.
+            w.grad.zero_()
+        self.assertListAlmostEqual(w.tolist(), [0.4, 0.2, -0.5], tol=1e-2)
+
+
+class ScheduleInitTest(unittest.TestCase):
+    m = torch.nn.Linear(50, 50)
+    optimizer = AdamW(m.parameters(), lr=10.)
+    num_steps = 10
+
+    def assertListAlmostEqual(self, list1, list2, tol):
+        self.assertEqual(len(list1), len(list2))
+        for a, b in zip(list1, list2):
+            self.assertAlmostEqual(a, b, delta=tol)
+
+    def test_constant_scheduler(self):
+        scheduler = ConstantLRSchedule(self.optimizer)
+        lrs = unwrap_schedule(scheduler, self.num_steps)
+        expected_learning_rates = [10.] * self.num_steps
+        self.assertEqual(len(lrs[0]), 1)
+        self.assertListEqual([l[0] for l in lrs], expected_learning_rates)
+
+        scheduler = ConstantLRSchedule(self.optimizer)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
+    def test_warmup_constant_scheduler(self):
+        scheduler = WarmupConstantSchedule(self.optimizer, warmup_steps=4)
+        lrs = unwrap_schedule(scheduler, self.num_steps)
+        expected_learning_rates = [2.5, 5.0, 7.5, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0]
+        self.assertEqual(len(lrs[0]), 1)
+        self.assertListEqual([l[0] for l in lrs], expected_learning_rates)
+
+        scheduler = WarmupConstantSchedule(self.optimizer, warmup_steps=4)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
+    def test_warmup_linear_scheduler(self):
+        scheduler = WarmupLinearSchedule(self.optimizer, warmup_steps=2, t_total=10)
+        lrs = unwrap_schedule(scheduler, self.num_steps)
+        expected_learning_rates = [5.0, 10.0, 8.75, 7.5, 6.25, 5.0, 3.75, 2.5, 1.25, 0.0]
+        self.assertEqual(len(lrs[0]), 1)
+        self.assertListEqual([l[0] for l in lrs], expected_learning_rates)
+
+        scheduler = WarmupLinearSchedule(self.optimizer, warmup_steps=2, t_total=10)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
+    def test_warmup_cosine_scheduler(self):
+        scheduler = WarmupCosineSchedule(self.optimizer, warmup_steps=2, t_total=10)
+        lrs = unwrap_schedule(scheduler, self.num_steps)
+        expected_learning_rates = [5.0, 10.0, 9.61, 8.53, 6.91, 5.0, 3.08, 1.46, 0.38, 0.0]
+        self.assertEqual(len(lrs[0]), 1)
+        self.assertListAlmostEqual([l[0] for l in lrs], expected_learning_rates, tol=1e-2)
+
+        scheduler = WarmupCosineSchedule(self.optimizer, warmup_steps=2, t_total=10)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
+    def test_warmup_cosine_hard_restart_scheduler(self):
+        scheduler = WarmupCosineWithHardRestartsSchedule(self.optimizer, warmup_steps=2, cycles=2, t_total=10)
+        lrs = unwrap_schedule(scheduler, self.num_steps)
+        expected_learning_rates = [5.0, 10.0, 8.53, 5.0, 1.46, 10.0, 8.53, 5.0, 1.46, 0.0]
+        self.assertEqual(len(lrs[0]), 1)
+        self.assertListAlmostEqual([l[0] for l in lrs], expected_learning_rates, tol=1e-2)
+
+        scheduler = WarmupCosineWithHardRestartsSchedule(self.optimizer, warmup_steps=2, cycles=2, t_total=10)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_auto_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_auto_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..f4f82083f21051b5d25ec0694ffeef183f538ad8
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_auto_test.py
@@ -0,0 +1,46 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+import logging
+
+from pytorch_transformers import AutoTokenizer, BertTokenizer, AutoTokenizer, GPT2Tokenizer
+from pytorch_transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
+from pytorch_transformers.modeling_gpt2 import GPT2_PRETRAINED_MODEL_ARCHIVE_MAP
+
+
+class AutoTokenizerTest(unittest.TestCase):
+    def test_tokenizer_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            tokenizer = AutoTokenizer.from_pretrained(model_name)
+            self.assertIsNotNone(tokenizer)
+            self.assertIsInstance(tokenizer, BertTokenizer)
+            self.assertGreater(len(tokenizer), 0)
+
+        for model_name in list(GPT2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            tokenizer = AutoTokenizer.from_pretrained(model_name)
+            self.assertIsNotNone(tokenizer)
+            self.assertIsInstance(tokenizer, GPT2Tokenizer)
+            self.assertGreater(len(tokenizer), 0)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_bert_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_bert_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..1111683ecc5409e5e4a8a3f2e98f9f0b3666766d
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_bert_test.py
@@ -0,0 +1,141 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+from io import open
+
+from pytorch_transformers.tokenization_bert import (BasicTokenizer,
+                                                    BertTokenizer,
+                                                    WordpieceTokenizer,
+                                                    _is_control, _is_punctuation,
+                                                    _is_whitespace, VOCAB_FILES_NAMES)
+
+from .tokenization_tests_commons import CommonTestCases
+
+class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = BertTokenizer
+
+    def setUp(self):
+        super(BertTokenizationTest, self).setUp()
+
+        vocab_tokens = [
+            "[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn",
+            "##ing", ",", "low", "lowest",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        with open(self.vocab_file, "w", encoding='utf-8') as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_tokenizer(self, **kwargs):
+        return BertTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"UNwant\u00E9d,running"
+        output_text = u"unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize(u"UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [7, 4, 5, 10, 8, 9])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(
+            tokenizer.tokenize(u"ah\u535A\u63A8zz"),
+            [u"ah", u"\u535A", u"\u63A8", u"zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(u" \tHeLLo!how  \n Are yoU?  "),
+            ["hello", "!", "how", "are", "you", "?"])
+        self.assertListEqual(tokenizer.tokenize(u"H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(u" \tHeLLo!how  \n Are yoU?  "),
+            ["HeLLo", "!", "how", "Are", "yoU", "?"])
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = [
+            "[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn",
+            "##ing"
+        ]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(
+            tokenizer.tokenize("unwanted running"),
+            ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(
+            tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_is_whitespace(self):
+        self.assertTrue(_is_whitespace(u" "))
+        self.assertTrue(_is_whitespace(u"\t"))
+        self.assertTrue(_is_whitespace(u"\r"))
+        self.assertTrue(_is_whitespace(u"\n"))
+        self.assertTrue(_is_whitespace(u"\u00A0"))
+
+        self.assertFalse(_is_whitespace(u"A"))
+        self.assertFalse(_is_whitespace(u"-"))
+
+    def test_is_control(self):
+        self.assertTrue(_is_control(u"\u0005"))
+
+        self.assertFalse(_is_control(u"A"))
+        self.assertFalse(_is_control(u" "))
+        self.assertFalse(_is_control(u"\t"))
+        self.assertFalse(_is_control(u"\r"))
+
+    def test_is_punctuation(self):
+        self.assertTrue(_is_punctuation(u"-"))
+        self.assertTrue(_is_punctuation(u"$"))
+        self.assertTrue(_is_punctuation(u"`"))
+        self.assertTrue(_is_punctuation(u"."))
+
+        self.assertFalse(_is_punctuation(u"A"))
+        self.assertFalse(_is_punctuation(u" "))
+
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("bert-base-uncased")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_dilbert_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_dilbert_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..42f80609981406dad2d7c73ecceca3ee660a7511
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_dilbert_test.py
@@ -0,0 +1,46 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+from io import open
+
+from pytorch_transformers.tokenization_distilbert import (DistilBertTokenizer)
+
+from .tokenization_tests_commons import CommonTestCases
+from .tokenization_bert_test import BertTokenizationTest
+
+class DistilBertTokenizationTest(BertTokenizationTest):
+
+    tokenizer_class = DistilBertTokenizer
+
+    def get_tokenizer(self, **kwargs):
+        return DistilBertTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def test_sequence_builders(self):
+        tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_gpt2_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_gpt2_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..8ee9cb0b5420c86fef080fee27665b8f12509c5c
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_gpt2_test.py
@@ -0,0 +1,72 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+import json
+from io import open
+
+from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer, VOCAB_FILES_NAMES
+
+from .tokenization_tests_commons import CommonTestCases
+
+class GPT2TokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = GPT2Tokenizer
+
+    def setUp(self):
+        super(GPT2TokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
+                 "\u0120", "\u0120l", "\u0120n",
+                 "\u0120lo", "\u0120low", "er",
+                 "\u0120lowest", "\u0120newer", "\u0120wider", "<unk>"]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return GPT2Tokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u" lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = GPT2Tokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [14, 15, 10, 9, 3, 2, 15, 19]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_openai_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_openai_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..6b86416d2d6b6123c9c3fd3e109899dc6daa6174
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_openai_test.py
@@ -0,0 +1,72 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+import json
+
+from pytorch_transformers.tokenization_openai import OpenAIGPTTokenizer, VOCAB_FILES_NAMES
+
+from .tokenization_tests_commons import CommonTestCases
+
+
+class OpenAIGPTTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = OpenAIGPTTokenizer
+
+    def setUp(self):
+        super(OpenAIGPTTokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
+                 "w</w>", "r</w>", "t</w>",
+                 "lo", "low", "er</w>",
+                 "low</w>", "lowest</w>", "newer</w>", "wider</w>", "<unk>"]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "l o", "lo w", "e r</w>", ""]
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w") as fp:
+            fp.write(json.dumps(vocab_tokens))
+        with open(self.merges_file, "w") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        return OpenAIGPTTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u"lower newer"
+        return input_text, output_text
+
+
+    def test_full_tokenizer(self):
+        tokenizer = OpenAIGPTTokenizer(self.vocab_file, self.merges_file)
+
+        text = "lower"
+        bpe_tokens = ["low", "er</w>"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + ["<unk>"]
+        input_bpe_tokens = [14, 15, 20]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_roberta_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_roberta_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..8add2529a54d06494b3c856784cddace2fe27c9d
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -0,0 +1,98 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import json
+import unittest
+from io import open
+
+from pytorch_transformers.tokenization_roberta import RobertaTokenizer, VOCAB_FILES_NAMES
+from .tokenization_tests_commons import CommonTestCases
+
+
+class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
+    tokenizer_class = RobertaTokenizer
+
+    def setUp(self):
+        super(RobertaTokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
+                 "\u0120", "\u0120l", "\u0120n",
+                 "\u0120lo", "\u0120low", "er",
+                 "\u0120lowest", "\u0120newer", "\u0120wider", "<unk>"]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return RobertaTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u" lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = RobertaTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [14, 15, 10, 9, 3, 2, 15, 19]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def roberta_dict_integration_testing(self):
+        tokenizer = self.get_tokenizer()
+
+        self.assertListEqual(
+            tokenizer.encode('Hello world!'),
+            [0, 31414, 232, 328, 2]
+        )
+        self.assertListEqual(
+            tokenizer.encode('Hello world! cécé herlolip 418'),
+            [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]
+        )
+
+    def test_sequence_builders(self):
+        tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_text_from_decode = tokenizer.encode("sequence builders", add_special_tokens=True)
+        encoded_pair_from_decode = tokenizer.encode("sequence builders", "multi-sequence build", add_special_tokens=True)
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == encoded_text_from_decode
+        assert encoded_pair == encoded_pair_from_decode
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_tests_commons.py b/Optimus/code/pytorch_transformers/tests/tokenization_tests_commons.py
new file mode 100755
index 0000000000000000000000000000000000000000..3da0494ac44fbc062f6896a89cd4942056d14444
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_tests_commons.py
@@ -0,0 +1,188 @@
+# coding=utf-8
+# Copyright 2019 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import sys
+from io import open
+import tempfile
+import shutil
+import unittest
+
+if sys.version_info[0] == 2:
+    import cPickle as pickle
+
+    class TemporaryDirectory(object):
+        """Context manager for tempfile.mkdtemp() so it's usable with "with" statement."""
+        def __enter__(self):
+            self.name = tempfile.mkdtemp()
+            return self.name
+        def __exit__(self, exc_type, exc_value, traceback):
+            shutil.rmtree(self.name)
+else:
+    import pickle
+    TemporaryDirectory = tempfile.TemporaryDirectory
+    unicode = str
+
+
+class CommonTestCases:
+
+    class CommonTokenizerTester(unittest.TestCase):
+
+        tokenizer_class = None
+
+        def setUp(self):
+            self.tmpdirname = tempfile.mkdtemp()
+
+        def tearDown(self):
+            shutil.rmtree(self.tmpdirname)
+
+        def get_tokenizer(self, **kwargs):
+            raise NotImplementedError
+
+        def get_input_output_texts(self):
+            raise NotImplementedError
+
+        def test_tokenizers_common_properties(self):
+            tokenizer = self.get_tokenizer()
+            attributes_list = ["bos_token", "eos_token", "unk_token", "sep_token",
+                                "pad_token", "cls_token", "mask_token"]
+            for attr in attributes_list:
+                self.assertTrue(hasattr(tokenizer, attr))
+                self.assertTrue(hasattr(tokenizer, attr + "_id"))
+
+            self.assertTrue(hasattr(tokenizer, "additional_special_tokens"))
+            self.assertTrue(hasattr(tokenizer, 'additional_special_tokens_ids'))
+
+            attributes_list = ["max_len", "init_inputs", "init_kwargs", "added_tokens_encoder",
+                                "added_tokens_decoder"]
+            for attr in attributes_list:
+                self.assertTrue(hasattr(tokenizer, attr))
+
+        def test_save_and_load_tokenizer(self):
+            # safety check on max_len default value so we are sure the test works
+            tokenizer = self.get_tokenizer()
+            self.assertNotEqual(tokenizer.max_len, 42)
+
+            # Now let's start the test
+            tokenizer = self.get_tokenizer(max_len=42)
+
+            before_tokens = tokenizer.encode(u"He is very happy, UNwant\u00E9d,running")
+
+            with TemporaryDirectory() as tmpdirname:
+                tokenizer.save_pretrained(tmpdirname)
+                tokenizer = self.tokenizer_class.from_pretrained(tmpdirname)
+
+                after_tokens = tokenizer.encode(u"He is very happy, UNwant\u00E9d,running")
+                self.assertListEqual(before_tokens, after_tokens)
+
+                self.assertEqual(tokenizer.max_len, 42)
+                tokenizer = self.tokenizer_class.from_pretrained(tmpdirname, max_len=43)
+                self.assertEqual(tokenizer.max_len, 43)
+
+        def test_pickle_tokenizer(self):
+            tokenizer = self.get_tokenizer()
+            self.assertIsNotNone(tokenizer)
+
+            text = u"Munich and Berlin are nice cities"
+            subwords = tokenizer.tokenize(text)
+
+            with TemporaryDirectory() as tmpdirname:
+
+                filename = os.path.join(tmpdirname, u"tokenizer.bin")
+                pickle.dump(tokenizer, open(filename, "wb"))
+
+                tokenizer_new = pickle.load(open(filename, "rb"))
+
+            subwords_loaded = tokenizer_new.tokenize(text)
+
+            self.assertListEqual(subwords, subwords_loaded)
+
+
+        def test_add_tokens_tokenizer(self):
+            tokenizer = self.get_tokenizer()
+
+            vocab_size = tokenizer.vocab_size
+            all_size = len(tokenizer)
+
+            self.assertNotEqual(vocab_size, 0)
+            self.assertEqual(vocab_size, all_size)
+
+            new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+            added_toks = tokenizer.add_tokens(new_toks)
+            vocab_size_2 = tokenizer.vocab_size
+            all_size_2 = len(tokenizer)
+
+            self.assertNotEqual(vocab_size_2, 0)
+            self.assertEqual(vocab_size, vocab_size_2)
+            self.assertEqual(added_toks, len(new_toks))
+            self.assertEqual(all_size_2, all_size + len(new_toks))
+
+            tokens = tokenizer.encode("aaaaa bbbbbb low cccccccccdddddddd l")
+            out_string = tokenizer.decode(tokens)
+
+            self.assertGreaterEqual(len(tokens), 4)
+            self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+            self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+            new_toks_2 = {'eos_token': ">>>>|||<||<<|<<",
+                          'pad_token': "<<<<<|||>|>>>>|>"}
+            added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+            vocab_size_3 = tokenizer.vocab_size
+            all_size_3 = len(tokenizer)
+
+            self.assertNotEqual(vocab_size_3, 0)
+            self.assertEqual(vocab_size, vocab_size_3)
+            self.assertEqual(added_toks_2, len(new_toks_2))
+            self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+            tokens = tokenizer.encode(">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l")
+            out_string = tokenizer.decode(tokens)
+
+            self.assertGreaterEqual(len(tokens), 6)
+            self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+            self.assertGreater(tokens[0], tokens[1])
+            self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+            self.assertGreater(tokens[-2], tokens[-3])
+            self.assertEqual(tokens[0], tokenizer.eos_token_id)
+            self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+
+        def test_required_methods_tokenizer(self):
+            tokenizer = self.get_tokenizer()
+            input_text, output_text = self.get_input_output_texts()
+
+            tokens = tokenizer.tokenize(input_text)
+            ids = tokenizer.convert_tokens_to_ids(tokens)
+            ids_2 = tokenizer.encode(input_text)
+            self.assertListEqual(ids, ids_2)
+
+            tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+            text_2 = tokenizer.decode(ids)
+
+            self.assertEqual(text_2, output_text)
+
+            self.assertNotEqual(len(tokens_2), 0)
+            self.assertIsInstance(text_2, (str, unicode))
+
+
+        def test_pretrained_model_lists(self):
+            weights_list = list(self.tokenizer_class.max_model_input_sizes.keys())
+            weights_lists_2 = []
+            for file_id, map_list in self.tokenizer_class.pretrained_vocab_files_map.items():
+                weights_lists_2.append(list(map_list.keys()))
+
+            for weights_list_2 in weights_lists_2:
+                self.assertListEqual(weights_list, weights_list_2)
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_transfo_xl_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_transfo_xl_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..f881cf1d2b4279da69d6f1ef61aa9841e3b347ad
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_transfo_xl_test.py
@@ -0,0 +1,74 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+from io import open
+
+from pytorch_transformers.tokenization_transfo_xl import TransfoXLTokenizer, VOCAB_FILES_NAMES
+
+from.tokenization_tests_commons import CommonTestCases
+
+class TransfoXLTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = TransfoXLTokenizer
+
+    def setUp(self):
+        super(TransfoXLTokenizationTest, self).setUp()
+
+        vocab_tokens = [
+            "<unk>", "[CLS]", "[SEP]", "want", "unwanted", "wa", "un",
+            "running", ",", "low", "l",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        with open(self.vocab_file, "w", encoding='utf-8') as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs['lower_case'] = True
+        return TransfoXLTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"<unk> UNwanted , running"
+        output_text = u"<unk> unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = TransfoXLTokenizer(vocab_file=self.vocab_file, lower_case=True)
+
+        tokens = tokenizer.tokenize(u"<unk> UNwanted , running")
+        self.assertListEqual(tokens, ["<unk>", "unwanted", ",", "running"])
+
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens), [0, 4, 8, 7])
+
+    def test_full_tokenizer_lower(self):
+        tokenizer = TransfoXLTokenizer(lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(u" \tHeLLo ! how  \n Are yoU ?  "),
+            ["hello", "!", "how", "are", "you", "?"])
+
+    def test_full_tokenizer_no_lower(self):
+        tokenizer = TransfoXLTokenizer(lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(u" \tHeLLo ! how  \n Are yoU ?  "),
+            ["HeLLo", "!", "how", "Are", "yoU", "?"])
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_utils_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_utils_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..26ec2d7a3946e660d93cbfe7194c69ee5d2f069c
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_utils_test.py
@@ -0,0 +1,46 @@
+# coding=utf-8
+# Copyright 2018 HuggingFace Inc..
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import six
+
+from pytorch_transformers import PreTrainedTokenizer
+from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer
+
+class TokenizerUtilsTest(unittest.TestCase):
+    def check_tokenizer_from_pretrained(self, tokenizer_class):
+        s3_models = list(tokenizer_class.max_model_input_sizes.keys())
+        for model_name in s3_models[:1]:
+            tokenizer = tokenizer_class.from_pretrained(model_name)
+            self.assertIsNotNone(tokenizer)
+            self.assertIsInstance(tokenizer, tokenizer_class)
+            self.assertIsInstance(tokenizer, PreTrainedTokenizer)
+
+            for special_tok in tokenizer.all_special_tokens:
+                if six.PY2:
+                    self.assertIsInstance(special_tok, unicode)
+                else:
+                    self.assertIsInstance(special_tok, str)
+                special_tok_id = tokenizer.convert_tokens_to_ids(special_tok)
+                self.assertIsInstance(special_tok_id, int)
+
+    def test_pretrained_tokenizers(self):
+        self.check_tokenizer_from_pretrained(GPT2Tokenizer)
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_xlm_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_xlm_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..43f1e0c5dd7396d592159ca121e04ee29ff37cc0
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_xlm_test.py
@@ -0,0 +1,82 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+import json
+
+from pytorch_transformers.tokenization_xlm import XLMTokenizer, VOCAB_FILES_NAMES
+
+from .tokenization_tests_commons import CommonTestCases
+
+class XLMTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = XLMTokenizer
+
+    def setUp(self):
+        super(XLMTokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
+                 "w</w>", "r</w>", "t</w>",
+                 "lo", "low", "er</w>",
+                 "low</w>", "lowest</w>", "newer</w>", "wider</w>", "<unk>"]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["l o 123", "lo w 1456", "e r</w> 1789", ""]
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w") as fp:
+            fp.write(json.dumps(vocab_tokens))
+        with open(self.merges_file, "w") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        return XLMTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u"lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        """ Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
+        tokenizer = XLMTokenizer(self.vocab_file, self.merges_file)
+
+        text = "lower"
+        bpe_tokens = ["low", "er</w>"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + ["<unk>"]
+        input_bpe_tokens = [14, 15, 20]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_sequence_builders(self):
+        tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-en-2048")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == [1] + text + [1]
+        assert encoded_pair == [1] + text + [1] + text_2 + [1]
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tests/tokenization_xlnet_test.py b/Optimus/code/pytorch_transformers/tests/tokenization_xlnet_test.py
new file mode 100755
index 0000000000000000000000000000000000000000..c603ce55f9d7f94ff41b89b5968ef4dab7fba196
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tests/tokenization_xlnet_test.py
@@ -0,0 +1,106 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+
+from pytorch_transformers.tokenization_xlnet import (XLNetTokenizer, SPIECE_UNDERLINE)
+
+from .tokenization_tests_commons import CommonTestCases
+
+SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)),
+                    'fixtures/test_sentencepiece.model')
+
+class XLNetTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = XLNetTokenizer
+
+    def setUp(self):
+        super(XLNetTokenizationTest, self).setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, keep_accents=True)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_tokenizer(self, **kwargs):
+        return XLNetTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"This is a test"
+        output_text = u"This is a test"
+        return input_text, output_text
+
+
+    def test_full_tokenizer(self):
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, keep_accents=True)
+
+        tokens = tokenizer.tokenize(u'This is a test')
+        self.assertListEqual(tokens, [u'▁This', u'▁is', u'▁a', u'▁t', u'est'])
+
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382])
+
+        tokens = tokenizer.tokenize(u"I was born in 92000, and this is falsé.")
+        self.assertListEqual(tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
+                                    u'or', u'n', SPIECE_UNDERLINE + u'in', SPIECE_UNDERLINE + u'',
+                                    u'9', u'2', u'0', u'0', u'0', u',', SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
+                                    SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's', u'é', u'.'])
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(
+            ids, [8, 21, 84, 55, 24, 19, 7, 0,
+                602, 347, 347, 347, 3, 12, 66,
+                46, 72, 80, 6, 0, 4])
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(back_tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
+                                        u'or', u'n', SPIECE_UNDERLINE + u'in',
+                                        SPIECE_UNDERLINE + u'', u'<unk>', u'2', u'0', u'0', u'0', u',',
+                                        SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
+                                        SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's',
+                                        u'<unk>', u'.'])
+
+    def test_tokenizer_lower(self):
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, do_lower_case=True)
+        tokens = tokenizer.tokenize(u"I was born in 92000, and this is falsé.")
+        self.assertListEqual(tokens, [SPIECE_UNDERLINE + u'', u'i', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
+                                      u'or', u'n', SPIECE_UNDERLINE + u'in', SPIECE_UNDERLINE + u'',
+                                      u'9', u'2', u'0', u'0', u'0', u',', SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
+                                      SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u'se', u'.'])
+        self.assertListEqual(tokenizer.tokenize(u"H\u00E9llo"), [u"▁he", u"ll", u"o"])
+
+    def test_tokenizer_no_lower(self):
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, do_lower_case=False)
+        tokens = tokenizer.tokenize(u"I was born in 92000, and this is falsé.")
+        self.assertListEqual(tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b', u'or',
+                                      u'n', SPIECE_UNDERLINE + u'in', SPIECE_UNDERLINE + u'',
+                                      u'9', u'2', u'0', u'0', u'0', u',', SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
+                                      SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u'se', u'.'])
+
+    def test_sequence_builders(self):
+        tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == text + [4, 3]
+        assert encoded_pair == text + [4] + text_2 + [4, 3]
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/Optimus/code/pytorch_transformers/tokenization_auto.py b/Optimus/code/pytorch_transformers/tokenization_auto.py
new file mode 100755
index 0000000000000000000000000000000000000000..889774b36c9255e24507a84515c3e8e97dc6b574
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_auto.py
@@ -0,0 +1,120 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Model class. """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+
+from .tokenization_bert import BertTokenizer
+from .tokenization_openai import OpenAIGPTTokenizer
+from .tokenization_gpt2 import GPT2Tokenizer
+from .tokenization_transfo_xl import TransfoXLTokenizer
+from .tokenization_xlnet import XLNetTokenizer
+from .tokenization_xlm import XLMTokenizer
+from .tokenization_roberta import RobertaTokenizer
+from .tokenization_distilbert import DistilBertTokenizer
+
+logger = logging.getLogger(__name__)
+
+class AutoTokenizer(object):
+    r""":class:`~pytorch_transformers.AutoTokenizer` is a generic tokenizer class
+        that will be instantiated as one of the tokenizer classes of the library
+        when created with the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method take care of returning the correct tokenizer class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The tokenizer class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertTokenizer (DistilBert model)
+            - contains `roberta`: RobertaTokenizer (RoBERTa model)
+            - contains `bert`: BertTokenizer (Bert model)
+            - contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
+            - contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
+            - contains `xlnet`: XLNetTokenizer (XLNet model)
+            - contains `xlm`: XLMTokenizer (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throw an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoTokenizer is designed to be instantiated "
+            "using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
+        r""" Instantiate a one of the tokenizer classes of the library
+        from a pre-trained model vocabulary.
+
+        The tokenizer class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `distilbert`: DistilBertTokenizer (DistilBert model)
+            - contains `roberta`: RobertaTokenizer (XLM model)
+            - contains `bert`: BertTokenizer (Bert model)
+            - contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
+            - contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
+            - contains `xlnet`: XLNetTokenizer (XLNet model)
+            - contains `xlm`: XLMTokenizer (XLM model)
+
+        Params:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
+                - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the vocabulary files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.
+
+            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~pytorch_transformers.PreTrainedTokenizer` for details.
+
+        Examples::
+
+            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')    # Download vocabulary from S3 and cache.
+            tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
+
+        """
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
+            return BertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            return OpenAIGPTTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'gpt2' in pretrained_model_name_or_path:
+            return GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            return TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
diff --git a/Optimus/code/pytorch_transformers/tokenization_bert.py b/Optimus/code/pytorch_transformers/tokenization_bert.py
new file mode 100755
index 0000000000000000000000000000000000000000..b85a4ccf9c382f49e2ba6c68a4e76a8c8d99ef19
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_bert.py
@@ -0,0 +1,457 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import logging
+import os
+import unicodedata
+from io import open
+
+from .tokenization_utils import PreTrainedTokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {'vocab_file': 'vocab.txt'}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
+        'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
+        'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
+        'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
+        'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
+        'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
+        'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
+        'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt",
+        'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt",
+        'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt",
+        'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
+        'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
+        'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'bert-base-uncased': 512,
+    'bert-large-uncased': 512,
+    'bert-base-cased': 512,
+    'bert-large-cased': 512,
+    'bert-base-multilingual-uncased': 512,
+    'bert-base-multilingual-cased': 512,
+    'bert-base-chinese': 512,
+    'bert-base-german-cased': 512,
+    'bert-large-uncased-whole-word-masking': 512,
+    'bert-large-cased-whole-word-masking': 512,
+    'bert-large-uncased-whole-word-masking-finetuned-squad': 512,
+    'bert-large-cased-whole-word-masking-finetuned-squad': 512,
+    'bert-base-cased-finetuned-mrpc': 512,
+}
+
+PRETRAINED_INIT_CONFIGURATION = {
+    'bert-base-uncased': {'do_lower_case': True},
+    'bert-large-uncased': {'do_lower_case': True},
+    'bert-base-cased': {'do_lower_case': False},
+    'bert-large-cased': {'do_lower_case': False},
+    'bert-base-multilingual-uncased': {'do_lower_case': True},
+    'bert-base-multilingual-cased': {'do_lower_case': False},
+    'bert-base-chinese': {'do_lower_case': False},
+    'bert-base-german-cased': {'do_lower_case': False},
+    'bert-large-uncased-whole-word-masking': {'do_lower_case': True},
+    'bert-large-cased-whole-word-masking': {'do_lower_case': False},
+    'bert-large-uncased-whole-word-masking-finetuned-squad': {'do_lower_case': True},
+    'bert-large-cased-whole-word-masking-finetuned-squad': {'do_lower_case': False},
+    'bert-base-cased-finetuned-mrpc': {'do_lower_case': False},
+}
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip('\n')
+        vocab[token] = index
+    return vocab
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class BertTokenizer(PreTrainedTokenizer):
+    r"""
+    Constructs a BertTokenizer.
+    :class:`~pytorch_transformers.BertTokenizer` runs end-to-end tokenization: punctuation splitting + wordpiece
+
+    Args:
+        vocab_file: Path to a one-wordpiece-per-line vocabulary file
+        do_lower_case: Whether to lower case the input. Only has an effect when do_wordpiece_only=False
+        do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+        max_len: An artificial maximum length to truncate tokenized sequences to; Effective maximum length is always the
+            minimum of this value (if specified) and the underlying BERT model's sequence length.
+        never_split: List of tokens which will never be split during tokenization. Only has an effect when
+            do_wordpiece_only=False
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None,
+                 unk_token="[UNK]", sep_token="[SEP]", pad_token="[PAD]", cls_token="[CLS]",
+                 mask_token="[MASK]", tokenize_chinese_chars=True, **kwargs):
+        """Constructs a BertTokenizer.
+
+        Args:
+            **vocab_file**: Path to a one-wordpiece-per-line vocabulary file
+            **do_lower_case**: (`optional`) boolean (default True)
+                Whether to lower case the input
+                Only has an effect when do_basic_tokenize=True
+            **do_basic_tokenize**: (`optional`) boolean (default True)
+                Whether to do basic tokenization before wordpiece.
+            **never_split**: (`optional`) list of string
+                List of tokens which will never be split during tokenization.
+                Only has an effect when do_basic_tokenize=True
+            **tokenize_chinese_chars**: (`optional`) boolean (default True)
+                Whether to tokenize Chinese characters.
+                This should likely be deactivated for Japanese:
+                see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328
+        """
+        super(BertTokenizer, self).__init__(unk_token=unk_token, sep_token=sep_token,
+                                            pad_token=pad_token, cls_token=cls_token,
+                                            mask_token=mask_token, **kwargs)
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file))
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict(
+            [(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case,
+                                                  never_split=never_split,
+                                                  tokenize_chinese_chars=tokenize_chinese_chars)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
+    def _tokenize(self, text):
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+                for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                    split_tokens.append(sub_token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str/unicode) in an id using the vocab. """
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (string/unicode) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = ' '.join(tokens).replace(' ##', '').strip()
+        return out_string
+
+    def add_special_tokens_single_sentence(self, token_ids):
+        """
+        Adds special tokens to the a sequence for sequence classification tasks.
+        A BERT sequence has the following format: [CLS] X [SEP]
+        """
+        return [self.cls_token_id] + token_ids + [self.sep_token_id]
+
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+
+    def save_vocabulary(self, vocab_path):
+        """Save the tokenizer vocabulary to a directory or file."""
+        index = 0
+        if os.path.isdir(vocab_path):
+            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES['vocab_file'])
+        else:
+            vocab_file = vocab_path
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: vocabulary indices are not consecutive."
+                                   " Please check that the vocabulary is not corrupted!".format(vocab_file))
+                    index = token_index
+                writer.write(token + u'\n')
+                index += 1
+        return (vocab_file,)
+
+
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True):
+        """ Constructs a BasicTokenizer.
+
+        Args:
+            **do_lower_case**: Whether to lower case the input.
+            **never_split**: (`optional`) list of str
+                Kept for backward compatibility purposes.
+                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)
+                List of token not to split.
+            **tokenize_chinese_chars**: (`optional`) boolean (default True)
+                Whether to tokenize Chinese characters.
+                This should likely be deactivated for Japanese:
+                see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328
+        """
+        if never_split is None:
+            never_split = []
+        self.do_lower_case = do_lower_case
+        self.never_split = never_split
+        self.tokenize_chinese_chars = tokenize_chinese_chars
+
+    def tokenize(self, text, never_split=None):
+        """ Basic Tokenization of a piece of text.
+            Split on "white spaces" only, for sub-word tokenization, see WordPieceTokenizer.
+
+        Args:
+            **never_split**: (`optional`) list of str
+                Kept for backward compatibility purposes.
+                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)
+                List of token not to split.
+        """
+        never_split = self.never_split + (never_split if never_split is not None else [])
+        text = self._clean_text(text)
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        if self.tokenize_chinese_chars:
+            text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case and token not in never_split:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """Splits punctuation on a piece of text."""
+        if never_split is not None and text in never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+                (cp >= 0x3400 and cp <= 0x4DBF) or  #
+                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+                (cp >= 0x2B820 and cp <= 0x2CEAF) or
+                (cp >= 0xF900 and cp <= 0xFAFF) or  #
+                (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        For example:
+          input = "unaffable"
+          output = ["un", "##aff", "##able"]
+
+        Args:
+          text: A single token or whitespace separated tokens. This should have
+            already been passed through `BasicTokenizer`.
+
+        Returns:
+          A list of wordpiece tokens.
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+            (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
diff --git a/Optimus/code/pytorch_transformers/tokenization_distilbert.py b/Optimus/code/pytorch_transformers/tokenization_distilbert.py
new file mode 100755
index 0000000000000000000000000000000000000000..5a6d02f98df7f22e7a2e590aac7ea35f0c7d0862
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_distilbert.py
@@ -0,0 +1,62 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for DistilBERT."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import logging
+import os
+import unicodedata
+from io import open
+
+from .tokenization_bert import BertTokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {'vocab_file': 'vocab.txt'}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
+        'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'distilbert-base-uncased': 512,
+    'distilbert-base-uncased-distilled-squad': 512,
+}
+
+
+class DistilBertTokenizer(BertTokenizer):
+    r"""
+    Constructs a DistilBertTokenizer.
+    :class:`~pytorch_transformers.DistilBertTokenizer` is identical to BertTokenizer and runs end-to-end tokenization: punctuation splitting + wordpiece
+
+    Args:
+        vocab_file: Path to a one-wordpiece-per-line vocabulary file
+        do_lower_case: Whether to lower case the input. Only has an effect when do_wordpiece_only=False
+        do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+        max_len: An artificial maximum length to truncate tokenized sequences to; Effective maximum length is always the
+            minimum of this value (if specified) and the underlying BERT model's sequence length.
+        never_split: List of tokens which will never be split during tokenization. Only has an effect when
+            do_wordpiece_only=False
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
diff --git a/Optimus/code/pytorch_transformers/tokenization_gpt2.py b/Optimus/code/pytorch_transformers/tokenization_gpt2.py
new file mode 100755
index 0000000000000000000000000000000000000000..4ebe1ad57511a02fc3e590e16ab5fd3979eb43c5
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_gpt2.py
@@ -0,0 +1,224 @@
+# coding=utf-8
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for OpenAI GPT."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import sys
+import json
+import logging
+import os
+import regex as re
+from io import open
+
+try:
+    from functools import lru_cache
+except ImportError:
+    # Just a dummy decorator to get the checks to run on python2
+    # because honestly I don't want to support a byte-level unicode BPE tokenizer on python 2 right now.
+    def lru_cache():
+        return lambda func: func
+
+from .tokenization_utils import PreTrainedTokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {
+    'vocab_file': 'vocab.json',
+    'merges_file': 'merges.txt',
+}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json",
+        'gpt2-medium': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json",
+        'gpt2-large': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json",
+    },
+    'merges_file':
+    {
+        'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt",
+        'gpt2-medium': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt",
+        'gpt2-large': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'gpt2': 1024,
+    'gpt2-medium': 1024,
+    'gpt2-large': 1024,
+}
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a mapping to unicode strings.
+    We specifically avoids mapping to whitespace/control characters the bpe code barfs on.
+    
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    """
+    _chr = unichr if sys.version_info[0] == 2 else chr
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+class GPT2Tokenizer(PreTrainedTokenizer):
+    """
+    GPT-2 BPE tokenizer. Peculiarities:
+        - Byte-level Byte-Pair-Encoding
+        - Requires a space to start the input string => will add a space is there isn't.
+          As a consequence, this tokenizer `encode` and `decode` method will not conserve
+          the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
+                 bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs):
+        super(GPT2Tokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
+
+        self.encoder = json.load(open(vocab_file, encoding="utf-8"))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        bpe_data = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text):
+        """ Tokenize a string. """
+        text = ' ' + text  # GPT-2 (and RoBERTa) tokenizers need at least one space to begin the sentence with.
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            if sys.version_info[0] == 2:
+                token = ''.join(self.byte_encoder[ord(b)] for b in token) # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
+            else:
+                token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8')) # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str/unicode) in an id using the vocab. """
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (string/unicode) using the vocab."""
+        return self.decoder.get(index)
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        text = ''.join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
+        return text
+
+    def save_vocabulary(self, save_directory):
+        """Save the tokenizer vocabulary and merge files to a directory."""
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
+        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES['merges_file'])
+
+        with open(vocab_file, 'w', encoding='utf-8') as f:
+            f.write(json.dumps(self.encoder, ensure_ascii=False))
+
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            writer.write(u'#version: 0.2\n')
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: BPE merge indices are not consecutive."
+                                   " Please check that the tokenizer is not corrupted!".format(merge_file))
+                    index = token_index
+                writer.write(' '.join(bpe_tokens) + u'\n')
+                index += 1
+
+        return vocab_file, merge_file
\ No newline at end of file
diff --git a/Optimus/code/pytorch_transformers/tokenization_openai.py b/Optimus/code/pytorch_transformers/tokenization_openai.py
new file mode 100755
index 0000000000000000000000000000000000000000..0efbdb37c0c9b087336f57a44d2a2a111078e694
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_openai.py
@@ -0,0 +1,208 @@
+# coding=utf-8
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for OpenAI GPT."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import json
+import logging
+import os
+import re
+from io import open
+
+from .tokenization_utils import PreTrainedTokenizer
+from .tokenization_bert import BasicTokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {
+    'vocab_file': 'vocab.json',
+    'merges_file': 'merges.txt',
+}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'openai-gpt': "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json",
+    },
+    'merges_file':
+    {
+        'openai-gpt': "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'openai-gpt': 512,
+}
+
+def get_pairs(word):
+    """
+    Return set of symbol pairs in a word.
+    word is represented as tuple of symbols (symbols being variable-length strings)
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+def text_standardize(text):
+    """
+    fixes some issues the spacy tokenizer had on books corpus
+    also does some whitespace standardization
+    """
+    text = text.replace('—', '-')
+    text = text.replace('–', '-')
+    text = text.replace('―', '-')
+    text = text.replace('…', '...')
+    text = text.replace('´', "'")
+    text = re.sub(r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''', r' \1 ', text)
+    text = re.sub(r'\s*\n\s*', ' \n ', text)
+    text = re.sub(r'[^\S\n]+', ' ', text)
+    return text.strip()
+
+class OpenAIGPTTokenizer(PreTrainedTokenizer):
+    """
+    BPE tokenizer. Peculiarities:
+        - lower case all inputs
+        - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
+        super(OpenAIGPTTokenizer, self).__init__(unk_token=unk_token, **kwargs)
+
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
+
+        try:
+            import ftfy
+            from spacy.lang.en import English
+            _nlp = English()
+            self.nlp = _nlp.Defaults.create_tokenizer(_nlp)
+            self.fix_text = ftfy.fix_text
+        except ImportError:
+            logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")
+            self.nlp = BasicTokenizer(do_lower_case=True)
+            self.fix_text = None
+
+        self.encoder = json.load(open(vocab_file, encoding="utf-8"))
+        self.decoder = {v:k for k,v in self.encoder.items()}
+        merges = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
+        merges = [tuple(merge.split()) for merge in merges]
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {}
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    def bpe(self, token):
+        word = tuple(token[:-1]) + (token[-1] + '</w>',)
+        if token in self.cache:
+            return self.cache[token]
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token+'</w>'
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        if word == '\n  </w>':
+            word = '\n</w>'
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text):
+        """ Tokenize a string. """
+        split_tokens = []
+        if self.fix_text is None:
+            # Using BERT's BasicTokenizer
+            text = self.nlp.tokenize(text)
+            for token in text:
+                split_tokens.extend([t for t in self.bpe(token).split(' ')])
+        else:
+            # Using SpaCy & ftfy (original tokenization process of OpenAI GPT)
+            text = self.nlp(text_standardize(self.fix_text(text)))
+            for token in text:
+                split_tokens.extend([t for t in self.bpe(token.text.lower()).split(' ')])
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str/unicode) in an id using the vocab. """
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an id in a token (BPE) using the vocab."""
+        return self.decoder.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = ''.join(tokens).replace('</w>', ' ').strip()
+        return out_string
+
+    def save_vocabulary(self, save_directory):
+        """Save the tokenizer vocabulary and merge files to a directory."""
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
+        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES['merges_file'])
+
+        with open(vocab_file, 'w', encoding='utf-8') as f:
+            f.write(json.dumps(self.encoder, ensure_ascii=False))
+
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            writer.write(u'#version: 0.2\n')
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: BPE merge indices are not consecutive."
+                                   " Please check that the tokenizer is not corrupted!".format(merge_file))
+                    index = token_index
+                writer.write(' '.join(bpe_tokens) + u'\n')
+                index += 1
+
+        return vocab_file, merge_file
diff --git a/Optimus/code/pytorch_transformers/tokenization_roberta.py b/Optimus/code/pytorch_transformers/tokenization_roberta.py
new file mode 100755
index 0000000000000000000000000000000000000000..67808752d5108a4784449f9fcab1ce2822266bac
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_roberta.py
@@ -0,0 +1,98 @@
+# coding=utf-8
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for RoBERTa."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import sys
+import json
+import logging
+import os
+import regex as re
+from io import open
+
+from .tokenization_gpt2 import GPT2Tokenizer
+
+try:
+    from functools import lru_cache
+except ImportError:
+    # Just a dummy decorator to get the checks to run on python2
+    # because honestly I don't want to support a byte-level unicode BPE tokenizer on python 2 right now.
+    def lru_cache():
+        return lambda func: func
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {
+    'vocab_file': 'vocab.json',
+    'merges_file': 'merges.txt',
+}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json",
+        'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json",
+        'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json",
+    },
+    'merges_file':
+    {
+        'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt",
+        'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt",
+        'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'roberta-base': 512,
+    'roberta-large': 512,
+    'roberta-large-mnli': 512,
+}
+
+
+class RobertaTokenizer(GPT2Tokenizer):
+    """
+    RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:
+        - Byte-level Byte-Pair-Encoding
+        - Requires a space to start the input string => will add a space is there isn't.
+          As a consequence, this tokenizer `encode` and `decode` method will not conserve
+          the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, vocab_file, merges_file, errors='replace', bos_token="<s>", eos_token="</s>", sep_token="</s>",
+                 cls_token="<s>", unk_token="<unk>", pad_token='<pad>', mask_token='<mask>', **kwargs):
+        super(RobertaTokenizer, self).__init__(vocab_file=vocab_file, merges_file=merges_file, errors=errors,
+                                               bos_token=bos_token, eos_token=eos_token, unk_token=unk_token,
+                                               sep_token=sep_token, cls_token=cls_token, pad_token=pad_token,
+                                               mask_token=mask_token, **kwargs)
+
+    def add_special_tokens_single_sentence(self, token_ids):
+        """
+        Adds special tokens to a sequence for sequence classification tasks.
+        A RoBERTa sequence has the following format: <s> X </s>
+        """
+        return [self.cls_token_id] + token_ids + [self.sep_token_id]
+
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        A RoBERTa sequence pair has the following format: <s> A </s></s> B </s>
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
diff --git a/Optimus/code/pytorch_transformers/tokenization_transfo_xl.py b/Optimus/code/pytorch_transformers/tokenization_transfo_xl.py
new file mode 100755
index 0000000000000000000000000000000000000000..66bc01c1bb0196ac7ceda8b5dd51d9d735b77a75
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_transfo_xl.py
@@ -0,0 +1,575 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization classes for Transformer XL model.
+    Adapted from https://github.com/kimiyoung/transformer-xl.
+"""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import glob
+import logging
+import os
+import sys
+from collections import Counter, OrderedDict
+from io import open
+
+import torch
+import numpy as np
+
+from .file_utils import cached_path
+from .tokenization_utils import PreTrainedTokenizer
+
+if sys.version_info[0] == 2:
+    import cPickle as pickle
+else:
+    import pickle
+
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {'pretrained_vocab_file': 'vocab.bin', 'vocab_file': 'vocab.txt'}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'pretrained_vocab_file':
+    {
+        'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'transfo-xl-wt103': None,
+}
+
+PRETRAINED_CORPUS_ARCHIVE_MAP = {
+    'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-corpus.bin",
+}
+CORPUS_NAME = 'corpus.bin'
+
+class TransfoXLTokenizer(PreTrainedTokenizer):
+    """
+    Transformer-XL tokenizer adapted from Vocab class in https://github.com/kimiyoung/transformer-xl
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, special=None, min_freq=0, max_size=None, lower_case=False,
+                 delimiter=None, vocab_file=None, pretrained_vocab_file=None,
+                 never_split=None, unk_token="<unk>", eos_token="<eos>",
+                 additional_special_tokens=["<formula>"], **kwargs):
+        super(TransfoXLTokenizer, self).__init__(unk_token=unk_token, eos_token=eos_token,
+                                                 additional_special_tokens=additional_special_tokens,
+                                                 **kwargs)
+
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
+
+        if never_split is None:
+            never_split = self.all_special_tokens
+        if special is None:
+            special = []
+        self.counter = Counter()
+        self.special = special
+        self.min_freq = min_freq
+        self.max_size = max_size
+        self.lower_case = lower_case
+        self.delimiter = delimiter
+        self.vocab_file = vocab_file
+        self.never_split = never_split
+
+        if pretrained_vocab_file is not None:
+            # Hack because, honestly this tokenizer was not made to be used
+            # in a library like ours, at all.
+            vocab_dict = torch.load(pretrained_vocab_file)
+            for key, value in vocab_dict.items():
+                if key not in self.__dict__:
+                    self.__dict__[key] = value
+
+        if vocab_file is not None:
+            self.build_vocab()
+
+    def count_file(self, path, verbose=False, add_eos=False):
+        if verbose: logger.info('counting file {} ...'.format(path))
+        assert os.path.exists(path)
+
+        sents = []
+        with open(path, 'r', encoding='utf-8') as f:
+            for idx, line in enumerate(f):
+                if verbose and idx > 0 and idx % 500000 == 0:
+                    logger.info('    line {}'.format(idx))
+                symbols = self.tokenize(line, add_eos=add_eos)
+                self.counter.update(symbols)
+                sents.append(symbols)
+
+        return sents
+
+    def count_sents(self, sents, verbose=False):
+        """
+            sents : a list of sentences, each a list of tokenized symbols
+        """
+        if verbose: logger.info('counting {} sents ...'.format(len(sents)))
+        for idx, symbols in enumerate(sents):
+            if verbose and idx > 0 and idx % 500000 == 0:
+                logger.info('    line {}'.format(idx))
+            self.counter.update(symbols)
+
+    def _build_from_file(self, vocab_file):
+        self.idx2sym = []
+        self.sym2idx = OrderedDict()
+
+        with open(vocab_file, 'r', encoding='utf-8') as f:
+            for line in f:
+                symb = line.strip().split()[0]
+                self.add_symbol(symb)
+        if '<UNK>' in self.sym2idx:
+            self.unk_idx = self.sym2idx['<UNK>']
+        elif '<unk>' in self.sym2idx:
+            self.unk_idx = self.sym2idx['<unk>']
+        else:
+            raise ValueError('No <unkown> token in vocabulary')
+
+    def save_vocabulary(self, vocab_path):
+        """Save the tokenizer vocabulary to a directory or file."""
+        if os.path.isdir(vocab_path):
+            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES['pretrained_vocab_file'])
+        torch.save(self.__dict__, vocab_file)
+        return (vocab_file,)
+
+    def build_vocab(self):
+        if self.vocab_file:
+            logger.info('building vocab from {}'.format(self.vocab_file))
+            self._build_from_file(self.vocab_file)
+            logger.info('final vocab size {}'.format(len(self)))
+        else:
+            logger.info('building vocab with min_freq={}, max_size={}'.format(
+                self.min_freq, self.max_size))
+            self.idx2sym = []
+            self.sym2idx = OrderedDict()
+
+            for sym in self.special:
+                self.add_special(sym)
+
+            for sym, cnt in self.counter.most_common(self.max_size):
+                if cnt < self.min_freq: break
+                self.add_symbol(sym)
+
+            logger.info('final vocab size {} from {} unique tokens'.format(
+                len(self), len(self.counter)))
+
+    def encode_file(self, path, ordered=False, verbose=False, add_eos=True,
+            add_double_eos=False):
+        if verbose: logger.info('encoding file {} ...'.format(path))
+        assert os.path.exists(path)
+        encoded = []
+        with open(path, 'r', encoding='utf-8') as f:
+            for idx, line in enumerate(f):
+                if verbose and idx > 0 and idx % 500000 == 0:
+                    logger.info('    line {}'.format(idx))
+                symbols = self.tokenize(line, add_eos=add_eos,
+                    add_double_eos=add_double_eos)
+                encoded.append(self.convert_to_tensor(symbols))
+
+        if ordered:
+            encoded = torch.cat(encoded)
+
+        return encoded
+
+    def encode_sents(self, sents, ordered=False, verbose=False):
+        if verbose: logger.info('encoding {} sents ...'.format(len(sents)))
+        encoded = []
+        for idx, symbols in enumerate(sents):
+            if verbose and idx > 0 and idx % 500000 == 0:
+                logger.info('    line {}'.format(idx))
+            encoded.append(self.convert_to_tensor(symbols))
+
+        if ordered:
+            encoded = torch.cat(encoded)
+
+        return encoded
+
+    def add_special(self, sym):
+        if sym not in self.sym2idx:
+            self.idx2sym.append(sym)
+            self.sym2idx[sym] = len(self.idx2sym) - 1
+            setattr(self, '{}_idx'.format(sym.strip('<>')), self.sym2idx[sym])
+
+    def add_symbol(self, sym):
+        if sym not in self.sym2idx:
+            self.idx2sym.append(sym)
+            self.sym2idx[sym] = len(self.idx2sym) - 1
+
+    def _convert_id_to_token(self, idx):
+        """Converts an id in a token (BPE) using the vocab."""
+        assert 0 <= idx < len(self), 'Index {} out of vocabulary range'.format(idx)
+        return self.idx2sym[idx]
+
+    def _convert_token_to_id(self, sym):
+        """ Converts a token (str/unicode) in an id using the vocab. """
+        if sym in self.sym2idx:
+            return self.sym2idx[sym]
+        else:
+            # logger.info('encounter unk {}'.format(sym))
+            # assert '<eos>' not in sym
+            if hasattr(self, 'unk_idx'):
+                return self.sym2idx.get(sym, self.unk_idx)
+            # Backward compatibility with pre-trained models
+            elif '<unk>' in self.sym2idx:
+                return self.sym2idx['<unk>']
+            elif '<UNK>' in self.sym2idx:
+                return self.sym2idx['<UNK>']
+            else:
+                raise ValueError('Token not in vocabulary and no <unk> token in vocabulary for replacement')
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = ' '.join(tokens).strip()
+        return out_string
+
+    def convert_to_tensor(self, symbols):
+        return torch.LongTensor(self.convert_tokens_to_ids(symbols))
+
+    @property
+    def vocab_size(self):
+        return len(self.idx2sym)
+
+    def _tokenize(self, line, add_eos=False, add_double_eos=False):
+        line = line.strip()
+        # convert to lower case
+        if self.lower_case:
+            line = line.lower()
+
+        # empty delimiter '' will evaluate False
+        if self.delimiter == '':
+            symbols = line
+        else:
+            symbols = line.split(self.delimiter)
+
+        if add_double_eos: # lm1b
+            return ['<S>'] + symbols + ['<S>']
+        elif add_eos:
+            return symbols + ['<eos>']
+        else:
+            return symbols
+
+
+class LMOrderedIterator(object):
+    def __init__(self, data, bsz, bptt, device='cpu', ext_len=None):
+        """
+            data -- LongTensor -- the LongTensor is strictly ordered
+        """
+        self.bsz = bsz
+        self.bptt = bptt
+        self.ext_len = ext_len if ext_len is not None else 0
+
+        self.device = device
+
+        # Work out how cleanly we can divide the dataset into bsz parts.
+        self.n_step = data.size(0) // bsz
+
+        # Trim off any extra elements that wouldn't cleanly fit (remainders).
+        data = data.narrow(0, 0, self.n_step * bsz)
+
+        # Evenly divide the data across the bsz batches.
+        self.data = data.view(bsz, -1).t().contiguous().to(device)
+
+        # Number of mini-batches
+        self.n_batch = (self.n_step + self.bptt - 1) // self.bptt
+
+    def get_batch(self, i, bptt=None):
+        if bptt is None: bptt = self.bptt
+        seq_len = min(bptt, self.data.size(0) - 1 - i)
+
+        end_idx = i + seq_len
+        beg_idx = max(0, i - self.ext_len)
+
+        data = self.data[beg_idx:end_idx]
+        target = self.data[i+1:i+1+seq_len]
+
+        data_out = data.transpose(0, 1).contiguous().to(self.device)
+        target_out = target.transpose(0, 1).contiguous().to(self.device)
+
+        return data_out, target_out, seq_len
+
+    def get_fixlen_iter(self, start=0):
+        for i in range(start, self.data.size(0) - 1, self.bptt):
+            yield self.get_batch(i)
+
+    def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):
+        max_len = self.bptt + max_deviation * std
+        i = start
+        while True:
+            bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.
+            bptt = min(max_len, max(min_len, int(np.random.normal(bptt, std))))
+            data, target, seq_len = self.get_batch(i, bptt)
+            i += seq_len
+            yield data, target, seq_len
+            if i >= self.data.size(0) - 2:
+                break
+
+    def __iter__(self):
+        return self.get_fixlen_iter()
+
+
+class LMShuffledIterator(object):
+    def __init__(self, data, bsz, bptt, device='cpu', ext_len=None, shuffle=False):
+        """
+            data -- list[LongTensor] -- there is no order among the LongTensors
+        """
+        self.data = data
+
+        self.bsz = bsz
+        self.bptt = bptt
+        self.ext_len = ext_len if ext_len is not None else 0
+
+        self.device = device
+        self.shuffle = shuffle
+
+    def get_sent_stream(self):
+        # index iterator
+        epoch_indices = np.random.permutation(len(self.data)) if self.shuffle \
+            else np.array(range(len(self.data)))
+
+        # sentence iterator
+        for idx in epoch_indices:
+            yield self.data[idx]
+
+    def stream_iterator(self, sent_stream):
+        # streams for each data in the batch
+        streams = [None] * self.bsz
+
+        data = torch.LongTensor(self.bptt, self.bsz)
+        target = torch.LongTensor(self.bptt, self.bsz)
+
+        n_retain = 0
+
+        while True:
+            # data   : [n_retain+bptt x bsz]
+            # target : [bptt x bsz]
+            data[n_retain:].fill_(-1)
+            target.fill_(-1)
+
+            valid_batch = True
+
+            for i in range(self.bsz):
+                n_filled = 0
+                try:
+                    while n_filled < self.bptt:
+                        if streams[i] is None or len(streams[i]) <= 1:
+                            streams[i] = next(sent_stream)
+                        # number of new tokens to fill in
+                        n_new = min(len(streams[i]) - 1, self.bptt - n_filled)
+                        # first n_retain tokens are retained from last batch
+                        data[n_retain+n_filled:n_retain+n_filled+n_new, i] = \
+                            streams[i][:n_new]
+                        target[n_filled:n_filled+n_new, i] = \
+                            streams[i][1:n_new+1]
+                        streams[i] = streams[i][n_new:]
+                        n_filled += n_new
+                except StopIteration:
+                    valid_batch = False
+                    break
+
+            if not valid_batch:
+                return
+
+            data_out = data.transpose(0, 1).contiguous().to(self.device)
+            target_out = target.transpose(0, 1).contiguous().to(self.device)
+
+            yield data_out, target_out, self.bptt
+
+            n_retain = min(data.size(0), self.ext_len)
+            if n_retain > 0:
+                data[:n_retain] = data[-n_retain:]
+            data.resize_(n_retain + self.bptt, data.size(1))
+
+    def __iter__(self):
+        # sent_stream is an iterator
+        sent_stream = self.get_sent_stream()
+
+        for batch in self.stream_iterator(sent_stream):
+            yield batch
+
+
+class LMMultiFileIterator(LMShuffledIterator):
+    def __init__(self, paths, vocab, bsz, bptt, device='cpu', ext_len=None,
+        shuffle=False):
+
+        self.paths = paths
+        self.vocab = vocab
+
+        self.bsz = bsz
+        self.bptt = bptt
+        self.ext_len = ext_len if ext_len is not None else 0
+
+        self.device = device
+        self.shuffle = shuffle
+
+    def get_sent_stream(self, path):
+        sents = self.vocab.encode_file(path, add_double_eos=True)
+        if self.shuffle:
+            np.random.shuffle(sents)
+        sent_stream = iter(sents)
+
+        return sent_stream
+
+    def __iter__(self):
+        if self.shuffle:
+            np.random.shuffle(self.paths)
+
+        for path in self.paths:
+            # sent_stream is an iterator
+            sent_stream = self.get_sent_stream(path)
+            for batch in self.stream_iterator(sent_stream):
+                yield batch
+
+
+class TransfoXLCorpus(object):
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
+        """
+        Instantiate a pre-processed corpus.
+        """
+        vocab = TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        if pretrained_model_name_or_path in PRETRAINED_CORPUS_ARCHIVE_MAP:
+            corpus_file = PRETRAINED_CORPUS_ARCHIVE_MAP[pretrained_model_name_or_path]
+        else:
+            corpus_file = os.path.join(pretrained_model_name_or_path, CORPUS_NAME)
+        # redirect to the cache, if necessary
+        try:
+            resolved_corpus_file = cached_path(corpus_file, cache_dir=cache_dir)
+        except EnvironmentError:
+            logger.error(
+                "Corpus '{}' was not found in corpus list ({}). "
+                "We assumed '{}' was a path or url but couldn't find files {} "
+                "at this path or url.".format(
+                    pretrained_model_name_or_path,
+                    ', '.join(PRETRAINED_CORPUS_ARCHIVE_MAP.keys()),
+                    pretrained_model_name_or_path,
+                    corpus_file))
+            return None
+        if resolved_corpus_file == corpus_file:
+            logger.info("loading corpus file {}".format(corpus_file))
+        else:
+            logger.info("loading corpus file {} from cache at {}".format(
+                corpus_file, resolved_corpus_file))
+
+        # Instantiate tokenizer.
+        corpus = cls(*inputs, **kwargs)
+        corpus_dict = torch.load(resolved_corpus_file)
+        for key, value in corpus_dict.items():
+            corpus.__dict__[key] = value
+        corpus.vocab = vocab
+        if corpus.train is not None:
+            corpus.train = torch.tensor(corpus.train, dtype=torch.long)
+        if corpus.valid is not None:
+            corpus.valid = torch.tensor(corpus.valid, dtype=torch.long)
+        if corpus.test is not None:
+            corpus.test = torch.tensor(corpus.test, dtype=torch.long)
+        return corpus
+
+    def __init__(self, *args, **kwargs):
+        self.vocab = TransfoXLTokenizer(*args, **kwargs)
+        self.dataset = None
+        self.train = None
+        self.valid = None
+        self.test = None
+
+    def build_corpus(self, path, dataset):
+        self.dataset = dataset
+
+        if self.dataset in ['ptb', 'wt2', 'enwik8', 'text8']:
+            self.vocab.count_file(os.path.join(path, 'train.txt'))
+            self.vocab.count_file(os.path.join(path, 'valid.txt'))
+            self.vocab.count_file(os.path.join(path, 'test.txt'))
+        elif self.dataset == 'wt103':
+            self.vocab.count_file(os.path.join(path, 'train.txt'))
+        elif self.dataset == 'lm1b':
+            train_path_pattern = os.path.join(
+                path, '1-billion-word-language-modeling-benchmark-r13output',
+                'training-monolingual.tokenized.shuffled', 'news.en-*')
+            train_paths = glob.glob(train_path_pattern)
+            # the vocab will load from file when build_vocab() is called
+
+        self.vocab.build_vocab()
+
+        if self.dataset in ['ptb', 'wt2', 'wt103']:
+            self.train = self.vocab.encode_file(
+                os.path.join(path, 'train.txt'), ordered=True)
+            self.valid = self.vocab.encode_file(
+                os.path.join(path, 'valid.txt'), ordered=True)
+            self.test = self.vocab.encode_file(
+                os.path.join(path, 'test.txt'), ordered=True)
+        elif self.dataset in ['enwik8', 'text8']:
+            self.train = self.vocab.encode_file(
+                os.path.join(path, 'train.txt'), ordered=True, add_eos=False)
+            self.valid = self.vocab.encode_file(
+                os.path.join(path, 'valid.txt'), ordered=True, add_eos=False)
+            self.test = self.vocab.encode_file(
+                os.path.join(path, 'test.txt'), ordered=True, add_eos=False)
+        elif self.dataset == 'lm1b':
+            self.train = train_paths
+            self.valid = self.vocab.encode_file(
+                os.path.join(path, 'valid.txt'), ordered=False, add_double_eos=True)
+            self.test = self.vocab.encode_file(
+                os.path.join(path, 'test.txt'), ordered=False, add_double_eos=True)
+
+    def get_iterator(self, split, *args, **kwargs):
+        if split == 'train':
+            if self.dataset in ['ptb', 'wt2', 'wt103', 'enwik8', 'text8']:
+                data_iter = LMOrderedIterator(self.train, *args, **kwargs)
+            elif self.dataset == 'lm1b':
+                kwargs['shuffle'] = True
+                data_iter = LMMultiFileIterator(self.train, self.vocab, *args, **kwargs)
+        elif split in ['valid', 'test']:
+            data = self.valid if split == 'valid' else self.test
+            if self.dataset in ['ptb', 'wt2', 'wt103', 'enwik8', 'text8']:
+                data_iter = LMOrderedIterator(data, *args, **kwargs)
+            elif self.dataset == 'lm1b':
+                data_iter = LMShuffledIterator(data, *args, **kwargs)
+
+        return data_iter
+
+
+def get_lm_corpus(datadir, dataset):
+    fn = os.path.join(datadir, 'cache.pt')
+    fn_pickle = os.path.join(datadir, 'cache.pkl')
+    if os.path.exists(fn):
+        logger.info('Loading cached dataset...')
+        corpus = torch.load(fn_pickle)
+    elif os.path.exists(fn):
+        logger.info('Loading cached dataset from pickle...')
+        with open(fn, "rb") as fp:
+            corpus = pickle.load(fp)
+    else:
+        logger.info('Producing dataset {}...'.format(dataset))
+        kwargs = {}
+        if dataset in ['wt103', 'wt2']:
+            kwargs['special'] = ['<eos>']
+            kwargs['lower_case'] = False
+        elif dataset == 'ptb':
+            kwargs['special'] = ['<eos>']
+            kwargs['lower_case'] = True
+        elif dataset == 'lm1b':
+            kwargs['special'] = []
+            kwargs['lower_case'] = False
+            kwargs['vocab_file'] = os.path.join(datadir, '1b_word_vocab.txt')
+        elif dataset in ['enwik8', 'text8']:
+            pass
+
+        corpus = TransfoXLCorpus(datadir, dataset, **kwargs)
+        torch.save(corpus, fn)
+
+    return corpus
diff --git a/Optimus/code/pytorch_transformers/tokenization_utils.py b/Optimus/code/pytorch_transformers/tokenization_utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..1e2cd59648d764d43f65073dba6c34b318dd4a6b
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_utils.py
@@ -0,0 +1,815 @@
+# coding=utf-8
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for OpenAI GPT."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import logging
+import os
+import json
+import six
+import copy
+from io import open
+
+from .file_utils import cached_path
+
+logger = logging.getLogger(__name__)
+
+SPECIAL_TOKENS_MAP_FILE = 'special_tokens_map.json'
+ADDED_TOKENS_FILE = 'added_tokens.json'
+TOKENIZER_CONFIG_FILE = 'tokenizer_config.json'
+
+class PreTrainedTokenizer(object):
+    """ Base class for all tokenizers.
+    Handle all the shared methods for tokenization and special tokens as well as methods dowloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
+
+    This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
+
+    Class attributes (overridden by derived classes):
+
+        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
+        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the associated pretrained vocabulary file.
+        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.
+        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained models, and as associated values, a dictionnary of specific arguments to pass to the ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the ``from_pretrained()`` method.
+
+    Parameters:
+
+        - ``bos_token``: (`Optional`) string: a beginning of sentence token. Will be associated to ``self.bos_token`` and ``self.bos_token_id``
+
+        - ``eos_token``: (`Optional`) string: an end of sentence token. Will be associated to ``self.eos_token`` and ``self.eos_token_id``
+
+        - ``unk_token``: (`Optional`) string: an unknown token. Will be associated to ``self.unk_token`` and ``self.unk_token_id``
+
+        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence). Will be associated to ``self.sep_token`` and ``self.sep_token_id``
+
+        - ``pad_token``: (`Optional`) string: a padding token. Will be associated to ``self.pad_token`` and ``self.pad_token_id``
+
+        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model). Will be associated to ``self.cls_token`` and ``self.cls_token_id``
+
+        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``
+
+        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens. Adding all special tokens here ensure they won't be split by the tokenization process. Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``
+    """
+    vocab_files_names = {}
+    pretrained_vocab_files_map = {}
+    pretrained_init_configuration = {}
+    max_model_input_sizes = {}
+
+    SPECIAL_TOKENS_ATTRIBUTES = ["bos_token", "eos_token", "unk_token", "sep_token",
+                                 "pad_token", "cls_token", "mask_token",
+                                 "additional_special_tokens"]
+
+    @property
+    def bos_token(self):
+        """ Beginning of sentence token (string). Log an error if used while not having been set. """
+        if self._bos_token is None:
+            logger.error("Using bos_token, but it is not set yet.")
+        return self._bos_token
+
+    @property
+    def eos_token(self):
+        """ End of sentence token (string). Log an error if used while not having been set. """
+        if self._eos_token is None:
+            logger.error("Using eos_token, but it is not set yet.")
+        return self._eos_token
+
+    @property
+    def unk_token(self):
+        """ Unknown token (string). Log an error if used while not having been set. """
+        if self._unk_token is None:
+            logger.error("Using unk_token, but it is not set yet.")
+        return self._unk_token
+
+    @property
+    def sep_token(self):
+        """ Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. """
+        if self._sep_token is None:
+            logger.error("Using sep_token, but it is not set yet.")
+        return self._sep_token
+
+    @property
+    def pad_token(self):
+        """ Padding token (string). Log an error if used while not having been set. """
+        if self._pad_token is None:
+            logger.error("Using pad_token, but it is not set yet.")
+        return self._pad_token
+
+    @property
+    def cls_token(self):
+        """ Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
+        if self._cls_token is None:
+            logger.error("Using cls_token, but it is not set yet.")
+        return self._cls_token
+
+    @property
+    def mask_token(self):
+        """ Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. """
+        if self._mask_token is None:
+            logger.error("Using mask_token, but it is not set yet.")
+        return self._mask_token
+
+    @property
+    def additional_special_tokens(self):
+        """ All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. """
+        if self._additional_special_tokens is None:
+            logger.error("Using additional_special_tokens, but it is not set yet.")
+        return self._additional_special_tokens
+
+    @bos_token.setter
+    def bos_token(self, value):
+        self._bos_token = value
+
+    @eos_token.setter
+    def eos_token(self, value):
+        self._eos_token = value
+
+    @unk_token.setter
+    def unk_token(self, value):
+        self._unk_token = value
+
+    @sep_token.setter
+    def sep_token(self, value):
+        self._sep_token = value
+
+    @pad_token.setter
+    def pad_token(self, value):
+        self._pad_token = value
+
+    @cls_token.setter
+    def cls_token(self, value):
+        self._cls_token = value
+
+    @mask_token.setter
+    def mask_token(self, value):
+        self._mask_token = value
+
+    @additional_special_tokens.setter
+    def additional_special_tokens(self, value):
+        self._additional_special_tokens = value
+
+    @property
+    def bos_token_id(self):
+        """ Id of the beginning of sentence token in the vocabulary. Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.bos_token)
+
+    @property
+    def eos_token_id(self):
+        """ Id of the end of sentence token in the vocabulary. Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.eos_token)
+
+    @property
+    def unk_token_id(self):
+        """ Id of the unknown token in the vocabulary. Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.unk_token)
+
+    @property
+    def sep_token_id(self):
+        """ Id of the separation token in the vocabulary. E.g. separate context and query in an input sequence. Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.sep_token)
+
+    @property
+    def pad_token_id(self):
+        """ Id of the padding token in the vocabulary. Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.pad_token)
+
+    @property
+    def cls_token_id(self):
+        """ Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.cls_token)
+
+    @property
+    def mask_token_id(self):
+        """ Id of the mask token in the vocabulary. E.g. when training a model with masked-language modeling. Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.mask_token)
+
+    @property
+    def additional_special_tokens_ids(self):
+        """ Ids of all the additional special tokens in the vocabulary (list of integers). Log an error if used while not having been set. """
+        return self.convert_tokens_to_ids(self.additional_special_tokens)
+
+    def __init__(self, max_len=None, **kwargs):
+        self._bos_token = None
+        self._eos_token = None
+        self._unk_token = None
+        self._sep_token = None
+        self._pad_token = None
+        self._cls_token = None
+        self._mask_token = None
+        self._additional_special_tokens = []
+
+        self.max_len = max_len if max_len is not None else int(1e12)
+
+        # Added tokens
+        self.added_tokens_encoder = {}
+        self.added_tokens_decoder = {}
+
+        # inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)
+        self.init_inputs = ()
+        self.init_kwargs = {}
+
+        for key, value in kwargs.items():
+            if key in self.SPECIAL_TOKENS_ATTRIBUTES:
+                if key == 'additional_special_tokens':
+                    assert isinstance(value, (list, tuple)) and all(isinstance(t, str) or (six.PY2 and isinstance(t, unicode)) for t in value)
+                else:
+                    assert isinstance(value, str) or (six.PY2 and isinstance(value, unicode))
+                setattr(self, key, value)
+
+
+    @classmethod
+    def from_pretrained(cls, *inputs, **kwargs):
+        r"""
+        Instantiate a :class:`~pytorch_transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.
+
+        Args:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
+                - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the vocabulary files and override the cached versions if they exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.
+
+            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~pytorch_transformers.PreTrainedTokenizer` for details.
+
+        Examples::
+
+            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer
+
+            # Download vocabulary from S3 and cache.
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
+            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')
+
+            # If the tokenizer uses a single vocabulary file, you can point directly to this file
+            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')
+
+            # You can link tokens to special vocabulary when instantiating
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
+            # You should be sure '<unk>' is in the vocabulary when doing that.
+            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
+            assert tokenizer.unk_token == '<unk>'
+
+        """
+        return cls._from_pretrained(*inputs, **kwargs)
+
+
+    @classmethod
+    def _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):
+        cache_dir = kwargs.pop('cache_dir', None)
+        force_download = kwargs.pop('force_download', False)
+        proxies = kwargs.pop('proxies', None)
+
+        s3_models = list(cls.max_model_input_sizes.keys())
+        vocab_files = {}
+        init_configuration = {}
+        if pretrained_model_name_or_path in s3_models:
+            # Get the vocabulary from AWS S3 bucket
+            for file_id, map_list in cls.pretrained_vocab_files_map.items():
+                vocab_files[file_id] = map_list[pretrained_model_name_or_path]
+            if cls.pretrained_init_configuration and pretrained_model_name_or_path in cls.pretrained_init_configuration:
+                init_configuration = cls.pretrained_init_configuration[pretrained_model_name_or_path]
+        else:
+            # Get the vocabulary from local files
+            logger.info(
+                "Model name '{}' not found in model shortcut name list ({}). "
+                "Assuming '{}' is a path or url to a directory containing tokenizer files.".format(
+                    pretrained_model_name_or_path, ', '.join(s3_models),
+                    pretrained_model_name_or_path))
+
+            # Look for the tokenizer main vocabulary files
+            for file_id, file_name in cls.vocab_files_names.items():
+                if os.path.isdir(pretrained_model_name_or_path):
+                    # If a directory is provided we look for the standard filenames
+                    full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
+                else:
+                    # If a path to a file is provided we use it (will only work for non-BPE tokenizer using a single vocabulary file)
+                    full_file_name = pretrained_model_name_or_path
+                if not os.path.exists(full_file_name):
+                    logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
+                    full_file_name = None
+                vocab_files[file_id] = full_file_name
+
+            # Look for the additional tokens files
+            additional_files_names = {'added_tokens_file': ADDED_TOKENS_FILE,
+                                      'special_tokens_map_file': SPECIAL_TOKENS_MAP_FILE,
+                                      'tokenizer_config_file': TOKENIZER_CONFIG_FILE,
+                                      }
+
+            # If a path to a file was provided, get the parent directory
+            saved_directory = pretrained_model_name_or_path
+            if os.path.exists(saved_directory) and not os.path.isdir(saved_directory):
+                saved_directory = os.path.dirname(saved_directory)
+
+            for file_id, file_name in additional_files_names.items():
+                full_file_name = os.path.join(saved_directory, file_name)
+                if not os.path.exists(full_file_name):
+                    logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
+                    full_file_name = None
+                vocab_files[file_id] = full_file_name
+
+            if all(full_file_name is None for full_file_name in vocab_files.values()):
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find tokenizer files"
+                    "at this path or url.".format(
+                        pretrained_model_name_or_path, ', '.join(s3_models),
+                        pretrained_model_name_or_path, ))
+                return None
+
+        # Get files from url, cache, or disk depending on the case
+        try:
+            resolved_vocab_files = {}
+            for file_id, file_path in vocab_files.items():
+                if file_path is None:
+                    resolved_vocab_files[file_id] = None
+                else:
+                    resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
+        except EnvironmentError as e:
+            if pretrained_model_name_or_path in s3_models:
+                logger.error("Couldn't reach server to download vocabulary.")
+            else:
+                logger.error(
+                    "Model name '{}' was not found in model name list ({}). "
+                    "We assumed '{}' was a path or url but couldn't find files {} "
+                    "at this path or url.".format(
+                        pretrained_model_name_or_path, ', '.join(s3_models),
+                        pretrained_model_name_or_path, str(vocab_files.keys())))
+            raise e
+
+        for file_id, file_path in vocab_files.items():
+            if file_path == resolved_vocab_files[file_id]:
+                logger.info("loading file {}".format(file_path))
+            else:
+                logger.info("loading file {} from cache at {}".format(
+                    file_path, resolved_vocab_files[file_id]))
+
+        # Prepare tokenizer initialization kwargs
+        # Did we saved some inputs and kwargs to reload ?
+        tokenizer_config_file = resolved_vocab_files.pop('tokenizer_config_file', None)
+        if tokenizer_config_file is not None:
+            init_kwargs = json.load(open(tokenizer_config_file, encoding="utf-8"))
+            saved_init_inputs = init_kwargs.pop('init_inputs', ())
+            if not init_inputs:
+                init_inputs = saved_init_inputs
+        else:
+            init_kwargs = init_configuration
+
+        # Update with newly provided kwargs
+        init_kwargs.update(kwargs)
+
+        # Set max length if needed
+        if pretrained_model_name_or_path in cls.max_model_input_sizes:
+            # if we're using a pretrained model, ensure the tokenizer
+            # wont index sequences longer than the number of positional embeddings
+            max_len = cls.max_model_input_sizes[pretrained_model_name_or_path]
+            if max_len is not None and isinstance(max_len, (int, float)):
+                init_kwargs['max_len'] = min(init_kwargs.get('max_len', int(1e12)), max_len)
+
+        # Merge resolved_vocab_files arguments in init_kwargs.
+        added_tokens_file = resolved_vocab_files.pop('added_tokens_file', None)
+        special_tokens_map_file = resolved_vocab_files.pop('special_tokens_map_file', None)
+        for args_name, file_path in resolved_vocab_files.items():
+            if args_name not in init_kwargs:
+                init_kwargs[args_name] = file_path
+        if special_tokens_map_file is not None:
+            special_tokens_map = json.load(open(special_tokens_map_file, encoding="utf-8"))
+            for key, value in special_tokens_map.items():
+                if key not in init_kwargs:
+                    init_kwargs[key] = value
+
+        # Instantiate tokenizer.
+        tokenizer = cls(*init_inputs, **init_kwargs)
+
+        # Save inputs and kwargs for saving and re-loading with ``save_pretrained``
+        tokenizer.init_inputs = init_inputs
+        tokenizer.init_kwargs = init_kwargs
+
+        # Add supplementary tokens.
+        if added_tokens_file is not None:
+            added_tok_encoder = json.load(open(added_tokens_file, encoding="utf-8"))
+            added_tok_decoder = {v:k for k, v in added_tok_encoder.items()}
+            tokenizer.added_tokens_encoder.update(added_tok_encoder)
+            tokenizer.added_tokens_decoder.update(added_tok_decoder)
+
+        return tokenizer
+
+
+    def save_pretrained(self, save_directory):
+        """ Save the tokenizer vocabulary files together with:
+                - added tokens,
+                - special-tokens-to-class-attributes-mapping,
+                - tokenizer instantiation positional and keywords inputs (e.g. do_lower_case for Bert).
+
+            This won't save modifications other than (added tokens and special token mapping) you may have
+            applied to the tokenizer after the instantion (e.g. modifying tokenizer.do_lower_case after creation).
+
+            This method make sure the full tokenizer can then be re-loaded using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error("Saving directory ({}) should be a directory".format(save_directory))
+            return
+
+        special_tokens_map_file = os.path.join(save_directory, SPECIAL_TOKENS_MAP_FILE)
+        added_tokens_file = os.path.join(save_directory, ADDED_TOKENS_FILE)
+        tokenizer_config_file = os.path.join(save_directory, TOKENIZER_CONFIG_FILE)
+
+        tokenizer_config = copy.deepcopy(self.init_kwargs)
+        tokenizer_config['init_inputs'] = copy.deepcopy(self.init_inputs)
+        for file_id in self.vocab_files_names.keys():
+            tokenizer_config.pop(file_id, None)
+
+        with open(tokenizer_config_file, 'w', encoding='utf-8') as f:
+            f.write(json.dumps(tokenizer_config, ensure_ascii=False))
+
+        with open(special_tokens_map_file, 'w', encoding='utf-8') as f:
+            f.write(json.dumps(self.special_tokens_map, ensure_ascii=False))
+
+        with open(added_tokens_file, 'w', encoding='utf-8') as f:
+            if self.added_tokens_encoder:
+                out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)
+            else:
+                out_str = u"{}"
+            f.write(out_str)
+
+        vocab_files = self.save_vocabulary(save_directory)
+
+        return vocab_files + (special_tokens_map_file, added_tokens_file)
+
+
+    def save_vocabulary(self, save_directory):
+        """ Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens
+            and special token mappings.
+
+            Please use :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` `()` to save the full Tokenizer state if you want to reload it using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
+        """
+        raise NotImplementedError
+
+
+    def vocab_size(self):
+        """ Size of the base vocabulary (without the added tokens) """
+        raise NotImplementedError
+
+
+    def __len__(self):
+        """ Size of the full vocabulary with the added tokens """
+        return self.vocab_size + len(self.added_tokens_encoder)
+
+
+    def add_tokens(self, new_tokens):
+        """
+        Add a list of new tokens to the tokenizer class. If the new tokens are not in the
+        vocabulary, they are added to it with indices starting from length of the current vocabulary.
+
+        Args:
+            new_tokens: list of string. Each string is a token to add. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
+
+        Returns:
+            Number of tokens added to the vocabulary.
+
+        Examples::
+
+            # Let's see how to increase the vocabulary of Bert model and tokenizer
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+            model = BertModel.from_pretrained('bert-base-uncased')
+
+            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
+            print('We have added', num_added_toks, 'tokens')
+            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+        """
+        if not new_tokens:
+            return 0
+
+        to_add_tokens = []
+        for token in new_tokens:
+            assert isinstance(token, str) or (six.PY2 and isinstance(token, unicode))
+            if token != self.unk_token and \
+                    self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token):
+                to_add_tokens.append(token)
+                logger.info("Adding %s to the vocabulary", token)
+
+        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(to_add_tokens))
+        added_tok_decoder = {v:k for k, v in added_tok_encoder.items()}
+        self.added_tokens_encoder.update(added_tok_encoder)
+        self.added_tokens_decoder.update(added_tok_decoder)
+
+        return len(to_add_tokens)
+
+
+    def add_special_tokens(self, special_tokens_dict):
+        """
+        Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them
+        to class attributes. If special tokens are NOT in the vocabulary, they are added
+        to it (indexed starting from the last index of the current vocabulary).
+
+        Using `add_special_tokens` will ensure your special tokens can be used in several ways:
+
+        - special tokens are carefully handled by the tokenizer (they are never split)
+        - you can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This makes it easy to develop model-agnostic training and fine-tuning scripts.
+
+        When possible, special tokens are already registered for provided pretrained models (ex: BertTokenizer cls_token is already registered to be '[CLS]' and XLM's one is also registered to be '</s>')
+
+        Args:
+            special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:
+                [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``,
+                ``additional_special_tokens``].
+
+                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
+
+        Returns:
+            Number of tokens added to the vocabulary.
+
+        Examples::
+
+            # Let's see how to add a new classification token to GPT-2
+            tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+            model = GPT2Model.from_pretrained('gpt2')
+
+            special_tokens_dict = {'cls_token': '<CLS>'}
+
+            num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
+            print('We have added', num_added_toks, 'tokens')
+            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+
+            assert tokenizer.cls_token == '<CLS>'
+        """
+        if not special_tokens_dict:
+            return 0
+
+        added_tokens = 0
+        for key, value in special_tokens_dict.items():
+            assert key in self.SPECIAL_TOKENS_ATTRIBUTES
+            if key == 'additional_special_tokens':
+                assert isinstance(value, (list, tuple)) and all(isinstance(t, str) or (six.PY2 and isinstance(t, unicode)) for t in value)
+                added_tokens += self.add_tokens(value)
+            else:
+                assert isinstance(value, str) or (six.PY2 and isinstance(value, unicode))
+                added_tokens += self.add_tokens([value])
+            logger.info("Assigning %s to the %s key of the tokenizer", value, key)
+            setattr(self, key, value)
+
+        return added_tokens
+
+    def tokenize(self, text, **kwargs):
+        """ Converts a string in a sequence of tokens (string), using the tokenizer.
+            Split in words for word-based vocabulary or sub-words for sub-word-based
+            vocabularies (BPE/SentencePieces/WordPieces).
+
+            Take care of added tokens.
+        """
+        def split_on_token(tok, text):
+            result = []
+            split_text = text.split(tok)
+            for i, sub_text in enumerate(split_text):
+                sub_text = sub_text.strip()
+                if i == 0 and not sub_text:
+                    result += [tok]
+                elif i == len(split_text) - 1:
+                    if sub_text:
+                        result += [sub_text]
+                    else:
+                        pass
+                else:
+                    if sub_text:
+                        result += [sub_text]
+                    result += [tok]
+            return result
+
+        def split_on_tokens(tok_list, text):
+            if not text:
+                return []
+            if not tok_list:
+                return self._tokenize(text, **kwargs)
+
+            tokenized_text = []
+            text_list = [text]
+            for tok in tok_list:
+                tokenized_text = []
+                for sub_text in text_list:
+                    if sub_text not in self.added_tokens_encoder \
+                            and sub_text not in self.all_special_tokens:
+                        tokenized_text += split_on_token(tok, sub_text)
+                    else:
+                        tokenized_text += [sub_text]
+                text_list = tokenized_text
+
+            return sum((self._tokenize(token, **kwargs) if token not \
+                    in self.added_tokens_encoder and token not in self.all_special_tokens \
+                    else [token] for token in tokenized_text), [])
+
+        added_tokens = list(self.added_tokens_encoder.keys()) + self.all_special_tokens
+        tokenized_text = split_on_tokens(added_tokens, text)
+        return tokenized_text
+
+    def _tokenize(self, text, **kwargs):
+        """ Converts a string in a sequence of tokens (string), using the tokenizer.
+            Split in words for word-based vocabulary or sub-words for sub-word-based
+            vocabularies (BPE/SentencePieces/WordPieces).
+
+            Do NOT take care of added tokens.
+        """
+        raise NotImplementedError
+
+    def convert_tokens_to_ids(self, tokens):
+        """ Converts a single token, or a sequence of tokens, (str/unicode) in a single integer id
+            (resp. a sequence of ids), using the vocabulary.
+        """
+        if tokens is None:
+            return None
+
+        if isinstance(tokens, str) or (six.PY2 and isinstance(tokens, unicode)):
+            return self._convert_token_to_id_with_added_voc(tokens)
+
+        ids = []
+        for token in tokens:
+            ids.append(self._convert_token_to_id_with_added_voc(token))
+        if len(ids) > self.max_len:
+            logger.warning("Token indices sequence length is longer than the specified maximum sequence length "
+                           "for this model ({} > {}). Running this sequence through the model will result in "
+                           "indexing errors".format(len(ids), self.max_len))
+        return ids
+
+    def _convert_token_to_id_with_added_voc(self, token):
+        if token is None:
+            return None
+
+        if token in self.added_tokens_encoder:
+            return self.added_tokens_encoder[token]
+        return self._convert_token_to_id(token)
+
+    def _convert_token_to_id(self, token):
+        raise NotImplementedError
+
+    def encode(self, text, text_pair=None, add_special_tokens=False, **kwargs):
+        """
+        Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
+        
+        Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
+
+        Args:
+            text: The first sequence to be encoded.
+            text_pair: Optional second sequence to be encoded.
+            add_special_tokens: if set to ``True``, the sequences will be encoded with the special tokens relative
+                to their model.
+            **kwargs: passed to the `self.tokenize()` method
+        """
+        if text_pair is None:
+            if add_special_tokens:
+                return self.add_special_tokens_single_sentence(self.convert_tokens_to_ids(self.tokenize(text, **kwargs)))
+            else:
+                return self.convert_tokens_to_ids(self.tokenize(text, **kwargs))
+
+        first_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text, **kwargs)]
+        second_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text_pair, **kwargs)]
+
+        if add_special_tokens:
+            return self.add_special_tokens_sentences_pair(first_sentence_tokens, second_sentence_tokens)
+        else:
+            return first_sentence_tokens, second_sentence_tokens
+
+    def add_special_tokens_single_sentence(self, token_ids):
+        logger.warning("This tokenizer does not make use of special tokens. The sequence has been returned with no modification.")
+        return token_ids
+
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        logger.warning("This tokenizer does not make use of special tokens. The two sequences have been concatenated.")
+        return token_ids_0 + token_ids_1
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """ Converts a single index or a sequence of indices (integers) in a token "
+            (resp.) a sequence of tokens (str/unicode), using the vocabulary and added tokens.
+
+            Args:
+                skip_special_tokens: Don't decode special tokens (self.all_special_tokens). Default: False
+        """
+        if isinstance(ids, int):
+            if ids in self.added_tokens_decoder:
+                return self.added_tokens_decoder[ids]
+            else:
+                return self._convert_id_to_token(ids)
+        tokens = []
+        for index in ids:
+            if skip_special_tokens and index in self.all_special_ids:
+                continue
+            if index in self.added_tokens_decoder:
+                tokens.append(self.added_tokens_decoder[index])
+            else:
+                tokens.append(self._convert_id_to_token(index))
+        return tokens
+
+    def _convert_id_to_token(self, index):
+        raise NotImplementedError
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string.
+            The most simple way to do it is ' '.join(self.convert_ids_to_tokens(token_ids))
+            but we often want to remove sub-word tokenization artifacts at the same time.
+        """
+        return ' '.join(self.convert_ids_to_tokens(tokens))
+
+    def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
+        """
+        Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
+        with options to remove special tokens and clean up tokenization spaces.
+        Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
+        """
+        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
+
+        # To avoid mixing byte-level and unicode for byte-level BPT
+        # we need to build string separatly for added tokens and byte-level tokens
+        # cf. https://github.com/huggingface/pytorch-transformers/issues/1133
+        sub_texts = []
+        current_sub_text = []
+        for token in filtered_tokens:
+            if skip_special_tokens and token in self.all_special_ids:
+                continue
+            if token in self.added_tokens_encoder:
+                if current_sub_text:
+                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))
+                    current_sub_text = []
+                sub_texts.append(" " + token)
+            else:
+                current_sub_text.append(token)
+        if current_sub_text:
+            sub_texts.append(self.convert_tokens_to_string(current_sub_text))
+        text = ''.join(sub_texts)
+
+        if self._sep_token is not None and self._sep_token in text:
+            text = text.replace(self._cls_token, self._sep_token)
+            split_text = list(filter(lambda sentence: len(sentence) > 0, text.split(self._sep_token)))
+            if clean_up_tokenization_spaces:
+                clean_text = [self.clean_up_tokenization(text) for text in split_text]
+                return clean_text
+            else:
+                return split_text
+        else:
+            if clean_up_tokenization_spaces:
+                clean_text = self.clean_up_tokenization(text)
+                return clean_text
+            else:
+                return text
+
+    @property
+    def special_tokens_map(self):
+        """ A dictionary mapping special token class attribute (cls_token, unk_token...) to their
+            values ('<unk>', '<cls>'...)
+        """
+        set_attr = {}
+        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
+            attr_value = getattr(self, "_" + attr)
+            if attr_value:
+                set_attr[attr] = attr_value
+        return set_attr
+
+    @property
+    def all_special_tokens(self):
+        """ List all the special tokens ('<unk>', '<cls>'...) mapped to class attributes
+            (cls_token, unk_token...).
+        """
+        all_toks = []
+        set_attr = self.special_tokens_map
+        for attr_value in set_attr.values():
+            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])
+        all_toks = list(set(all_toks))
+        return all_toks
+
+    @property
+    def all_special_ids(self):
+        """ List the vocabulary indices of the special tokens ('<unk>', '<cls>'...) mapped to
+            class attributes (cls_token, unk_token...).
+        """
+        all_toks = self.all_special_tokens
+        all_ids = list(self._convert_token_to_id(t) for t in all_toks)
+        return all_ids
+
+    @staticmethod
+    def clean_up_tokenization(out_string):
+        """ Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.
+        """
+        out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ','
+                        ).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't"
+                        ).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")
+        return out_string
diff --git a/Optimus/code/pytorch_transformers/tokenization_xlm.py b/Optimus/code/pytorch_transformers/tokenization_xlm.py
new file mode 100755
index 0000000000000000000000000000000000000000..f7231384b31af5980487157a907ab2c53e33945b
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_xlm.py
@@ -0,0 +1,794 @@
+# coding=utf-8
+# Copyright 2019 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for OpenAI GPT."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import json
+import logging
+import os
+import re
+import sys
+import unicodedata
+from io import open
+
+import sacremoses as sm
+
+from .tokenization_utils import PreTrainedTokenizer
+from .tokenization_bert import BasicTokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {
+    'vocab_file': 'vocab.json',
+    'merges_file': 'merges.txt',
+}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'xlm-mlm-en-2048': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-vocab.json",
+        'xlm-mlm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-vocab.json",
+        'xlm-mlm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-vocab.json",
+        'xlm-mlm-enro-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-vocab.json",
+        'xlm-mlm-tlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-vocab.json",
+        'xlm-mlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-vocab.json",
+        'xlm-clm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-vocab.json",
+        'xlm-clm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-vocab.json",
+        'xlm-mlm-17-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-vocab.json",
+        'xlm-mlm-100-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-vocab.json",
+    },
+    'merges_file':
+    {
+        'xlm-mlm-en-2048': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-merges.txt",
+        'xlm-mlm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt",
+        'xlm-mlm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt",
+        'xlm-mlm-enro-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-merges.txt",
+        'xlm-mlm-tlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-merges.txt",
+        'xlm-mlm-xnli15-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-merges.txt",
+        'xlm-clm-enfr-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt",
+        'xlm-clm-ende-1024': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt",
+        'xlm-mlm-17-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-merges.txt",
+        'xlm-mlm-100-1280': "https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-merges.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'xlm-mlm-en-2048': 512,
+    'xlm-mlm-ende-1024': 512,
+    'xlm-mlm-enfr-1024': 512,
+    'xlm-mlm-enro-1024': 512,
+    'xlm-mlm-tlm-xnli15-1024': 512,
+    'xlm-mlm-xnli15-1024': 512,
+    'xlm-clm-enfr-1024': 512,
+    'xlm-clm-ende-1024': 512,
+    'xlm-mlm-17-1280': 512,
+    'xlm-mlm-100-1280': 512,
+}
+
+PRETRAINED_INIT_CONFIGURATION = {
+    'xlm-mlm-en-2048': {"do_lowercase_and_remove_accent": True},
+    'xlm-mlm-ende-1024': { "do_lowercase_and_remove_accent": True,
+                            "id2lang": { "0": "de",
+                                        "1": "en"},
+                           "lang2id": { "de": 0,
+                                        "en": 1 }},
+    'xlm-mlm-enfr-1024': { "do_lowercase_and_remove_accent": True,
+                           "id2lang": { "0": "en",
+                                        "1": "fr"},
+                           "lang2id": { "en": 0,
+                                        "fr": 1 }},
+    'xlm-mlm-enro-1024': { "do_lowercase_and_remove_accent": True,
+                           "id2lang": { "0": "en",
+                                        "1": "ro"},
+                           "lang2id": { "en": 0,
+                                        "ro": 1 }},
+    'xlm-mlm-tlm-xnli15-1024': { "do_lowercase_and_remove_accent": True,
+                                 "id2lang": {   "0": "ar",
+                                                "1": "bg",
+                                                "2": "de",
+                                                "3": "el",
+                                                "4": "en",
+                                                "5": "es",
+                                                "6": "fr",
+                                                "7": "hi",
+                                                "8": "ru",
+                                                "9": "sw",
+                                                "10": "th",
+                                                "11": "tr",
+                                                "12": "ur",
+                                                "13": "vi",
+                                                "14": "zh"},
+                                 "lang2id": {   "ar": 0,
+                                                "bg": 1,
+                                                "de": 2,
+                                                "el": 3,
+                                                "en": 4,
+                                                "es": 5,
+                                                "fr": 6,
+                                                "hi": 7,
+                                                "ru": 8,
+                                                "sw": 9,
+                                                "th": 10,
+                                                "tr": 11,
+                                                "ur": 12,
+                                                "vi": 13,
+                                                "zh": 14 }},
+    'xlm-mlm-xnli15-1024': { "do_lowercase_and_remove_accent": True,
+                             "id2lang": {   "0": "ar",
+                                                "1": "bg",
+                                                "2": "de",
+                                                "3": "el",
+                                                "4": "en",
+                                                "5": "es",
+                                                "6": "fr",
+                                                "7": "hi",
+                                                "8": "ru",
+                                                "9": "sw",
+                                                "10": "th",
+                                                "11": "tr",
+                                                "12": "ur",
+                                                "13": "vi",
+                                                "14": "zh"},
+                                 "lang2id": {   "ar": 0,
+                                                "bg": 1,
+                                                "de": 2,
+                                                "el": 3,
+                                                "en": 4,
+                                                "es": 5,
+                                                "fr": 6,
+                                                "hi": 7,
+                                                "ru": 8,
+                                                "sw": 9,
+                                                "th": 10,
+                                                "tr": 11,
+                                                "ur": 12,
+                                                "vi": 13,
+                                                "zh": 14 }},
+    'xlm-clm-enfr-1024': { "do_lowercase_and_remove_accent": True,
+                           "id2lang": { "0": "en",
+                                        "1": "fr"},
+                           "lang2id": { "en": 0,
+                                        "fr": 1 }},
+    'xlm-clm-ende-1024': { "do_lowercase_and_remove_accent": True,
+                           "id2lang": { "0": "de",
+                                        "1": "en"},
+                           "lang2id": { "de": 0,
+                                        "en": 1 }},
+    'xlm-mlm-17-1280': {"do_lowercase_and_remove_accent": False,
+                        "id2lang": {
+                            "0": "ar",
+                            "1": "de",
+                            "2": "en",
+                            "3": "es",
+                            "4": "fr",
+                            "5": "hi",
+                            "6": "it",
+                            "7": "ja",
+                            "8": "ko",
+                            "9": "nl",
+                            "10": "pl",
+                            "11": "pt",
+                            "12": "ru",
+                            "13": "sv",
+                            "14": "tr",
+                            "15": "vi",
+                            "16": "zh"
+                        },
+                        "lang2id": {
+                            "ar": 0,
+                            "de": 1,
+                            "en": 2,
+                            "es": 3,
+                            "fr": 4,
+                            "hi": 5,
+                            "it": 6,
+                            "ja": 7,
+                            "ko": 8,
+                            "nl": 9,
+                            "pl": 10,
+                            "pt": 11,
+                            "ru": 12,
+                            "sv": 13,
+                            "tr": 14,
+                            "vi": 15,
+                            "zh": 16}},
+    'xlm-mlm-100-1280': {"do_lowercase_and_remove_accent": False,
+                        "id2lang": {
+                            "0": "af",
+                            "1": "als",
+                            "2": "am",
+                            "3": "an",
+                            "4": "ang",
+                            "5": "ar",
+                            "6": "arz",
+                            "7": "ast",
+                            "8": "az",
+                            "9": "bar",
+                            "10": "be",
+                            "11": "bg",
+                            "12": "bn",
+                            "13": "br",
+                            "14": "bs",
+                            "15": "ca",
+                            "16": "ceb",
+                            "17": "ckb",
+                            "18": "cs",
+                            "19": "cy",
+                            "20": "da",
+                            "21": "de",
+                            "22": "el",
+                            "23": "en",
+                            "24": "eo",
+                            "25": "es",
+                            "26": "et",
+                            "27": "eu",
+                            "28": "fa",
+                            "29": "fi",
+                            "30": "fr",
+                            "31": "fy",
+                            "32": "ga",
+                            "33": "gan",
+                            "34": "gl",
+                            "35": "gu",
+                            "36": "he",
+                            "37": "hi",
+                            "38": "hr",
+                            "39": "hu",
+                            "40": "hy",
+                            "41": "ia",
+                            "42": "id",
+                            "43": "is",
+                            "44": "it",
+                            "45": "ja",
+                            "46": "jv",
+                            "47": "ka",
+                            "48": "kk",
+                            "49": "kn",
+                            "50": "ko",
+                            "51": "ku",
+                            "52": "la",
+                            "53": "lb",
+                            "54": "lt",
+                            "55": "lv",
+                            "56": "mk",
+                            "57": "ml",
+                            "58": "mn",
+                            "59": "mr",
+                            "60": "ms",
+                            "61": "my",
+                            "62": "nds",
+                            "63": "ne",
+                            "64": "nl",
+                            "65": "nn",
+                            "66": "no",
+                            "67": "oc",
+                            "68": "pl",
+                            "69": "pt",
+                            "70": "ro",
+                            "71": "ru",
+                            "72": "scn",
+                            "73": "sco",
+                            "74": "sh",
+                            "75": "si",
+                            "76": "simple",
+                            "77": "sk",
+                            "78": "sl",
+                            "79": "sq",
+                            "80": "sr",
+                            "81": "sv",
+                            "82": "sw",
+                            "83": "ta",
+                            "84": "te",
+                            "85": "th",
+                            "86": "tl",
+                            "87": "tr",
+                            "88": "tt",
+                            "89": "uk",
+                            "90": "ur",
+                            "91": "uz",
+                            "92": "vi",
+                            "93": "war",
+                            "94": "wuu",
+                            "95": "yi",
+                            "96": "zh",
+                            "97": "zh_classical",
+                            "98": "zh_min_nan",
+                            "99": "zh_yue"
+                        },
+                        "lang2id": {
+                            "af": 0,
+                            "als": 1,
+                            "am": 2,
+                            "an": 3,
+                            "ang": 4,
+                            "ar": 5,
+                            "arz": 6,
+                            "ast": 7,
+                            "az": 8,
+                            "bar": 9,
+                            "be": 10,
+                            "bg": 11,
+                            "bn": 12,
+                            "br": 13,
+                            "bs": 14,
+                            "ca": 15,
+                            "ceb": 16,
+                            "ckb": 17,
+                            "cs": 18,
+                            "cy": 19,
+                            "da": 20,
+                            "de": 21,
+                            "el": 22,
+                            "en": 23,
+                            "eo": 24,
+                            "es": 25,
+                            "et": 26,
+                            "eu": 27,
+                            "fa": 28,
+                            "fi": 29,
+                            "fr": 30,
+                            "fy": 31,
+                            "ga": 32,
+                            "gan": 33,
+                            "gl": 34,
+                            "gu": 35,
+                            "he": 36,
+                            "hi": 37,
+                            "hr": 38,
+                            "hu": 39,
+                            "hy": 40,
+                            "ia": 41,
+                            "id": 42,
+                            "is": 43,
+                            "it": 44,
+                            "ja": 45,
+                            "jv": 46,
+                            "ka": 47,
+                            "kk": 48,
+                            "kn": 49,
+                            "ko": 50,
+                            "ku": 51,
+                            "la": 52,
+                            "lb": 53,
+                            "lt": 54,
+                            "lv": 55,
+                            "mk": 56,
+                            "ml": 57,
+                            "mn": 58,
+                            "mr": 59,
+                            "ms": 60,
+                            "my": 61,
+                            "nds": 62,
+                            "ne": 63,
+                            "nl": 64,
+                            "nn": 65,
+                            "no": 66,
+                            "oc": 67,
+                            "pl": 68,
+                            "pt": 69,
+                            "ro": 70,
+                            "ru": 71,
+                            "scn": 72,
+                            "sco": 73,
+                            "sh": 74,
+                            "si": 75,
+                            "simple": 76,
+                            "sk": 77,
+                            "sl": 78,
+                            "sq": 79,
+                            "sr": 80,
+                            "sv": 81,
+                            "sw": 82,
+                            "ta": 83,
+                            "te": 84,
+                            "th": 85,
+                            "tl": 86,
+                            "tr": 87,
+                            "tt": 88,
+                            "uk": 89,
+                            "ur": 90,
+                            "uz": 91,
+                            "vi": 92,
+                            "war": 93,
+                            "wuu": 94,
+                            "yi": 95,
+                            "zh": 96,
+                            "zh_classical": 97,
+                            "zh_min_nan": 98,
+                            "zh_yue": 99
+                        }},
+}
+
+def get_pairs(word):
+    """
+    Return set of symbol pairs in a word.
+    word is represented as tuple of symbols (symbols being variable-length strings)
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+def lowercase_and_remove_accent(text):
+    """
+    Lowercase and strips accents from a piece of text based on
+    https://github.com/facebookresearch/XLM/blob/master/tools/lowercase_and_remove_accent.py
+    """
+    text = ' '.join(text)
+    text = text.lower()
+    text = unicodedata.normalize("NFD", text)
+    output = []
+    for char in text:
+        cat = unicodedata.category(char)
+        if cat == "Mn":
+            continue
+        output.append(char)
+    return "".join(output).lower().split(' ')
+
+
+def replace_unicode_punct(text):
+    '''
+    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/replace-unicode-punctuation.perl
+    '''
+    text = text.replace('，', ',')
+    text = re.sub(r'。\s*', '. ', text)
+    text = text.replace('、', ',')
+    text = text.replace('”', '"')
+    text = text.replace('“', '"')
+    text = text.replace('∶', ':')
+    text = text.replace('：', ':')
+    text = text.replace('？', '?')
+    text = text.replace('《', '"')
+    text = text.replace('》', '"')
+    text = text.replace('）', ')')
+    text = text.replace('！', '!')
+    text = text.replace('（', '(')
+    text = text.replace('；', ';')
+    text = text.replace('１', '"')
+    text = text.replace('」', '"')
+    text = text.replace('「', '"')
+    text = text.replace('０', '0')
+    text = text.replace('３', '3')
+    text = text.replace('２', '2')
+    text = text.replace('５', '5')
+    text = text.replace('６', '6')
+    text = text.replace('９', '9')
+    text = text.replace('７', '7')
+    text = text.replace('８', '8')
+    text = text.replace('４', '4')
+    text = re.sub(r'．\s*', '. ', text)
+    text = text.replace('～', '~')
+    text = text.replace('’', '\'')
+    text = text.replace('…', '...')
+    text = text.replace('━', '-')
+    text = text.replace('〈', '<')
+    text = text.replace('〉', '>')
+    text = text.replace('【', '[')
+    text = text.replace('】', ']')
+    text = text.replace('％', '%')
+    return text
+
+
+def remove_non_printing_char(text):
+    '''
+    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/remove-non-printing-char.perl
+    '''
+    output = []
+    for char in text:
+        cat = unicodedata.category(char)
+        if cat.startswith('C'):
+            continue
+        output.append(char)
+    return "".join(output)
+
+
+def romanian_preprocessing(text):
+    '''Sennrich's WMT16 scripts for Romanian preprocessing, used by model `xlm-mlm-enro-1024`'''
+    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/normalise-romanian.py
+    text = text.replace("\u015e", "\u0218").replace("\u015f", "\u0219")
+    text = text.replace("\u0162", "\u021a").replace("\u0163", "\u021b")
+    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/remove-diacritics.py
+    text = text.replace("\u0218", "S").replace("\u0219", "s") #s-comma
+    text = text.replace("\u021a", "T").replace("\u021b", "t") #t-comma
+    text = text.replace("\u0102", "A").replace("\u0103", "a")
+    text = text.replace("\u00C2", "A").replace("\u00E2", "a")
+    text = text.replace("\u00CE", "I").replace("\u00EE", "i")
+    return text
+
+
+class XLMTokenizer(PreTrainedTokenizer):
+    """
+    BPE tokenizer for XLM
+
+        - Moses preprocessing & tokenization for most supported languages
+
+        - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP)
+
+        - (optionally) lower case & normalize all inputs text
+
+        - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \
+        (ex: "__classify__") to a vocabulary
+        
+        - `lang2id` attribute maps the languages supported by the model with their ids if provided (automatically set for pretrained vocabularies)
+
+        - `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies)
+
+        - `do_lowercase_and_remove_accent` controle lower casing and accent (automatically set for pretrained vocabularies)
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, vocab_file, merges_file, unk_token="<unk>", bos_token="<s>",
+                 sep_token="</s>", pad_token="<pad>", cls_token="</s>",
+                 mask_token="<special1>", additional_special_tokens=["<special0>",
+                 "<special1>", "<special2>", "<special3>", "<special4>", "<special5>",
+                 "<special6>", "<special7>", "<special8>", "<special9>"],
+                 lang2id=None, id2lang=None, do_lowercase_and_remove_accent=True,
+                 **kwargs):
+        super(XLMTokenizer, self).__init__(unk_token=unk_token, bos_token=bos_token,
+                                           sep_token=sep_token, pad_token=pad_token,
+                                           cls_token=cls_token, mask_token=mask_token,
+                                           additional_special_tokens=additional_special_tokens,
+                                           **kwargs)
+
+        # cache of sm.MosesPunctNormalizer instance
+        self.cache_moses_punct_normalizer = dict()
+        # cache of sm.MosesTokenizer instance
+        self.cache_moses_tokenizer = dict()
+        self.lang_with_custom_tokenizer = set(['zh', 'th', 'ja'])
+        # True for current supported model (v1.2.0), False for XLM-17 & 100
+        self.do_lowercase_and_remove_accent = do_lowercase_and_remove_accent
+        self.lang2id = lang2id
+        self.id2lang = id2lang
+        if lang2id is not None and id2lang is not None:
+            assert len(lang2id) == len(id2lang)
+
+        self.ja_word_tokenizer = None
+        self.zh_word_tokenizer = None
+
+        self.encoder = json.load(open(vocab_file, encoding="utf-8"))
+        self.decoder = {v:k for k,v in self.encoder.items()}
+        merges = open(merges_file, encoding='utf-8').read().split('\n')[:-1]
+        merges = [tuple(merge.split()[:2]) for merge in merges]
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {}
+
+    def moses_punct_norm(self, text, lang):
+        if lang not in self.cache_moses_punct_normalizer:
+            punct_normalizer = sm.MosesPunctNormalizer(lang=lang)
+            self.cache_moses_punct_normalizer[lang] = punct_normalizer
+        else:
+            punct_normalizer = self.cache_moses_punct_normalizer[lang]
+        return punct_normalizer.normalize(text)
+
+    def moses_tokenize(self, text, lang):
+        if lang not in self.cache_moses_tokenizer:
+            moses_tokenizer = sm.MosesTokenizer(lang=lang)
+            self.cache_moses_tokenizer[lang] = moses_tokenizer
+        else:
+            moses_tokenizer = self.cache_moses_tokenizer[lang]
+        return moses_tokenizer.tokenize(text, return_str=False, escape=False)
+
+    def moses_pipeline(self, text, lang):
+        text = replace_unicode_punct(text)
+        text = self.moses_punct_norm(text, lang)
+        text = remove_non_printing_char(text)
+        return text
+
+    def ja_tokenize(self, text):
+        if self.ja_word_tokenizer is None:
+            try:
+                import Mykytea
+                self.ja_word_tokenizer = Mykytea.Mykytea('-model %s/local/share/kytea/model.bin' % os.path.expanduser('~'))
+            except (AttributeError, ImportError) as e:
+                logger.error("Make sure you install KyTea (https://github.com/neubig/kytea) and it's python wrapper (https://github.com/chezou/Mykytea-python) with the following steps")
+                logger.error("1. git clone git@github.com:neubig/kytea.git && cd kytea")
+                logger.error("2. autoreconf -i")
+                logger.error("3. ./configure --prefix=$HOME/local")
+                logger.error("4. make && make install")
+                logger.error("5. pip install kytea")
+                raise e
+        return list(self.ja_word_tokenizer.getWS(text))
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    def bpe(self, token):
+        word = tuple(token[:-1]) + (token[-1] + '</w>',)
+        if token in self.cache:
+            return self.cache[token]
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token+'</w>'
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        if word == '\n  </w>':
+            word = '\n</w>'
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text, lang='en', bypass_tokenizer=False):
+        """
+        Tokenize a string given language code. For Chinese, Japanese and Thai, we use a language specific tokenizerself. Otherwise, we use Moses.
+
+        Details of tokenization:
+        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses
+            - Install with `pip install sacremoses`
+        - [pythainlp](https://github.com/PyThaiNLP/pythainlp): Thai tokenizer
+            - Install with `pip install pythainlp`
+        - [kytea](https://github.com/chezou/Mykytea-python): Japanese tokenizer, wrapper of [KyTea](https://github.com/neubig/kytea)
+            - Install with the following steps:
+            ```
+            git clone git@github.com:neubig/kytea.git && cd kytea
+            autoreconf -i
+            ./configure --prefix=$HOME/local
+            make && make install
+            pip install kytea
+            ```
+        - [jieba](https://github.com/fxsjy/jieba): Chinese tokenizer *
+            - Install with `pip install jieba`
+
+        \* The original XLM used [Stanford Segmenter](https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip).
+        However, the wrapper (`nltk.tokenize.stanford_segmenter`) is slow due to JVM overhead, and it will be deprecated.
+        Jieba is a lot faster and pip-installable. Note there is some mismatch with the Stanford Segmenter. It should be fine
+        if you fine-tune the model with Chinese supervisionself. If you want the same exact behaviour, use the original XLM
+        [preprocessing script](https://github.com/facebookresearch/XLM/tree/master/tools) to tokenize the sentence externally,
+        and set `bypass_tokenizer=True` to bypass the tokenizer.
+
+        Args:
+            - lang: ISO language code (default = 'en') (string). Languages should belong of the model supported languages. However, we don't enforce it.
+            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.
+
+        Returns:
+            List of tokens.
+        """
+        if lang and self.lang2id and lang not in self.lang2id:
+            logger.error("Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.")
+        if bypass_tokenizer:
+            text = text.split()
+        elif lang not in self.lang_with_custom_tokenizer:
+            text = self.moses_pipeline(text, lang=lang)
+            # TODO: make sure we are using `xlm-mlm-enro-1024`, since XLM-100 doesn't have this step
+            if lang == 'ro':
+                text = romanian_preprocessing(text)
+            text = self.moses_tokenize(text, lang=lang)
+        elif lang == 'th':
+            text = self.moses_pipeline(text, lang=lang)
+            try:
+                if 'pythainlp' not in sys.modules:
+                    from pythainlp.tokenize import word_tokenize as th_word_tokenize
+                else:
+                    th_word_tokenize = sys.modules['pythainlp'].word_tokenize
+            except (AttributeError, ImportError) as e:
+                logger.error("Make sure you install PyThaiNLP (https://github.com/PyThaiNLP/pythainlp) with the following steps")
+                logger.error("1. pip install pythainlp")
+                raise e
+            text = th_word_tokenize(text)
+        elif lang == 'zh':
+            try:
+                if 'jieba' not in sys.modules:
+                    import jieba
+                else:
+                    jieba = sys.modules['jieba']
+            except (AttributeError, ImportError) as e:
+                logger.error("Make sure you install Jieba (https://github.com/fxsjy/jieba) with the following steps")
+                logger.error("1. pip install jieba")
+                raise e
+            text = ' '.join(jieba.cut(text))
+            text = self.moses_pipeline(text, lang=lang)
+            text = text.split()
+        elif lang == 'ja':
+            text = self.moses_pipeline(text, lang=lang)
+            text = self.ja_tokenize(text)
+        else:
+            raise ValueError('It should not reach here')
+
+        if self.do_lowercase_and_remove_accent and not bypass_tokenizer:
+            text = lowercase_and_remove_accent(text)
+
+        split_tokens = []
+        for token in text:
+            if token:
+                split_tokens.extend([t for t in self.bpe(token).split(' ')])
+
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str/unicode) in an id using the vocab. """
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (string/unicode) using the vocab."""
+        return self.decoder.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = ''.join(tokens).replace('</w>', ' ').strip()
+        return out_string
+
+    def add_special_tokens_single_sentence(self, token_ids):
+        """
+        Adds special tokens to a sequence for sequence classification tasks.
+        An XLM sequence has the following format: [CLS] X [SEP]
+        """
+        return [self.cls_token_id] + token_ids + [self.sep_token_id]
+
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        An XLM sequence pair has the following format: [CLS] A [SEP] B [SEP]
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+
+    def save_vocabulary(self, save_directory):
+        """Save the tokenizer vocabulary and merge files to a directory."""
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
+        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES['merges_file'])
+
+        with open(vocab_file, 'w', encoding='utf-8') as f:
+            f.write(json.dumps(self.encoder, ensure_ascii=False))
+
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: BPE merge indices are not consecutive."
+                                   " Please check that the tokenizer is not corrupted!".format(merge_file))
+                    index = token_index
+                writer.write(' '.join(bpe_tokens) + u'\n')
+                index += 1
+
+        return vocab_file, merge_file
diff --git a/Optimus/code/pytorch_transformers/tokenization_xlnet.py b/Optimus/code/pytorch_transformers/tokenization_xlnet.py
new file mode 100755
index 0000000000000000000000000000000000000000..230095daa9894452a684d306fcbab63dd43ff830
--- /dev/null
+++ b/Optimus/code/pytorch_transformers/tokenization_xlnet.py
@@ -0,0 +1,214 @@
+# coding=utf-8
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization classes for XLNet model."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import logging
+import os
+from shutil import copyfile
+
+import unicodedata
+import six
+
+from .tokenization_utils import PreTrainedTokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {'vocab_file': 'spiece.model'}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+    'xlnet-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model",
+    'xlnet-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-spiece.model",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'xlnet-base-cased': None,
+    'xlnet-large-cased': None,
+}
+
+SPIECE_UNDERLINE = u'▁'
+
+# Segments (not really needed)
+SEG_ID_A   = 0
+SEG_ID_B   = 1
+SEG_ID_CLS = 2
+SEG_ID_SEP = 3
+SEG_ID_PAD = 4
+
+class XLNetTokenizer(PreTrainedTokenizer):
+    """
+        SentencePiece based tokenizer. Peculiarities:
+
+            - requires `SentencePiece <https://github.com/google/sentencepiece>`_
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, vocab_file,
+                 do_lower_case=False, remove_space=True, keep_accents=False,
+                 bos_token="<s>", eos_token="</s>", unk_token="<unk>", sep_token="<sep>",
+                 pad_token="<pad>", cls_token="<cls>", mask_token="<mask>",
+                 additional_special_tokens=["<eop>", "<eod>"], **kwargs):
+        super(XLNetTokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token,
+                                             unk_token=unk_token, sep_token=sep_token,
+                                             pad_token=pad_token, cls_token=cls_token,
+                                             mask_token=mask_token, additional_special_tokens=
+                                             additional_special_tokens, **kwargs)
+
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
+
+        try:
+            import sentencepiece as spm
+        except ImportError:
+            logger.warning("You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece"
+                           "pip install sentencepiece")
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.vocab_file = vocab_file
+
+        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model.Load(vocab_file)
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model)
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+        try:
+            import sentencepiece as spm
+        except ImportError:
+            logger.warning("You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece"
+                           "pip install sentencepiece")
+        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model.Load(self.vocab_file)
+
+    def preprocess_text(self, inputs):
+        if self.remove_space:
+            outputs = ' '.join(inputs.strip().split())
+        else:
+            outputs = inputs
+        outputs = outputs.replace("``", '"').replace("''", '"')
+
+        if six.PY2 and isinstance(outputs, str):
+            outputs = outputs.decode('utf-8')
+
+        if not self.keep_accents:
+            outputs = unicodedata.normalize('NFKD', outputs)
+            outputs = ''.join([c for c in outputs if not unicodedata.combining(c)])
+        if self.do_lower_case:
+            outputs = outputs.lower()
+
+        return outputs
+
+    def _tokenize(self, text, return_unicode=True, sample=False):
+        """ Tokenize a string.
+            return_unicode is used only for py2
+        """
+        text = self.preprocess_text(text)
+        # note(zhiliny): in some systems, sentencepiece only accepts str for py2
+        if six.PY2 and isinstance(text, unicode):
+            text = text.encode('utf-8')
+
+        if not sample:
+            pieces = self.sp_model.EncodeAsPieces(text)
+        else:
+            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)
+        new_pieces = []
+        for piece in pieces:
+            if len(piece) > 1 and piece[-1] == ',' and piece[-2].isdigit():
+                cur_pieces = self.sp_model.EncodeAsPieces(
+                    piece[:-1].replace(SPIECE_UNDERLINE, ''))
+                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
+                    if len(cur_pieces[0]) == 1:
+                        cur_pieces = cur_pieces[1:]
+                    else:
+                        cur_pieces[0] = cur_pieces[0][1:]
+                cur_pieces.append(piece[-1])
+                new_pieces.extend(cur_pieces)
+            else:
+                new_pieces.append(piece)
+
+        # note(zhiliny): convert back to unicode for py2
+        if six.PY2 and return_unicode:
+            ret_pieces = []
+            for piece in new_pieces:
+                if isinstance(piece, str):
+                    piece = piece.decode('utf-8')
+                ret_pieces.append(piece)
+            new_pieces = ret_pieces
+
+        return new_pieces
+
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str/unicode) in an id using the vocab. """
+        return self.sp_model.PieceToId(token)
+
+    def _convert_id_to_token(self, index, return_unicode=True):
+        """Converts an index (integer) in a token (string/unicode) using the vocab."""
+        token = self.sp_model.IdToPiece(index)
+        if six.PY2 and return_unicode and isinstance(token, str):
+            token = token.decode('utf-8')
+        return token
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (strings for sub-words) in a single string."""
+        out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
+        return out_string
+
+    def add_special_tokens_single_sentence(self, token_ids):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        An XLNet sequence pair has the following format: A [SEP] B [SEP][CLS]
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        return token_ids + sep + cls
+
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence for sequence classification tasks.
+        An XLNet sequence has the following format: X [SEP][CLS]
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        return token_ids_0 + sep + token_ids_1 + sep + cls
+
+    def save_vocabulary(self, save_directory):
+        """ Save the sentencepiece vocabulary (copy original file) and special tokens file
+            to a directory.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+
+        return (out_vocab_file,)
diff --git a/Optimus/code/real_im_emb_plot.jpg b/Optimus/code/real_im_emb_plot.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..3c4e83452289deb080c7d335cba6768d8842586e
Binary files /dev/null and b/Optimus/code/real_im_emb_plot.jpg differ
diff --git a/Optimus/code/scripts/scripts_docker/.run_docker.sh.swp b/Optimus/code/scripts/scripts_docker/.run_docker.sh.swp
new file mode 100755
index 0000000000000000000000000000000000000000..2d64bee832ac9ec200520cab94ae47622d90442f
Binary files /dev/null and b/Optimus/code/scripts/scripts_docker/.run_docker.sh.swp differ
diff --git a/Optimus/code/scripts/scripts_docker/.run_docker.sh.swx b/Optimus/code/scripts/scripts_docker/.run_docker.sh.swx
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/Optimus/code/scripts/scripts_docker/4913 b/Optimus/code/scripts/scripts_docker/4913
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/Optimus/code/scripts/scripts_docker/run_docker.sh b/Optimus/code/scripts/scripts_docker/run_docker.sh
new file mode 100755
index 0000000000000000000000000000000000000000..88c04d2290ed73053efb4516dd476288eb874a48
--- /dev/null
+++ b/Optimus/code/scripts/scripts_docker/run_docker.sh
@@ -0,0 +1,11 @@
+SCRIPTPATH="/home/chunyl/research/project/optimus"
+IMAGE=chunyl/pytorch-transformers:v2
+
+docker run \
+--runtime=nvidia \
+-it --rm \
+--net host \
+--volume $SCRIPTPATH:/workspace \
+--interactive --tty $IMAGE /bin/bash
+
+
diff --git a/Optimus/code/scripts/scripts_docker/run_docker.sh~ b/Optimus/code/scripts/scripts_docker/run_docker.sh~
new file mode 100755
index 0000000000000000000000000000000000000000..9c6d36c707a6477daf74184b7a8e5cd70c542fba
--- /dev/null
+++ b/Optimus/code/scripts/scripts_docker/run_docker.sh~
@@ -0,0 +1,11 @@
+SCRIPTPATH="/home/chunyuan/azure_mounts/textae_azure"
+IMAGE=chunyl/pytorch-transformers:v1
+
+docker run \
+--runtime=nvidia \
+-it --rm \
+--net host \
+--volume $SCRIPTPATH:/workspace \
+--interactive --tty $IMAGE /bin/bash
+
+
diff --git a/Optimus/code/scripts/scripts_hpc/rr3_scl_ae.json b/Optimus/code/scripts/scripts_hpc/rr3_scl_ae.json
new file mode 100755
index 0000000000000000000000000000000000000000..8d094728ebc338a96f8f5513bf7dc016ab20c947
--- /dev/null
+++ b/Optimus/code/scripts/scripts_hpc/rr3_scl_ae.json
@@ -0,0 +1,54 @@
+{
+  "ClusterId": "rr1",
+  "VcId": "resrchprojvc7",
+  "JobName": "textae_wiki_beta_g8_768",
+  "UserName": "xiul",
+  "BuildId": 0,
+  "ToolType": null,
+  "ConfigFile": "/aztextae/code/examples/big_ae/run_lm_vae_pretraining_phdist_beta.py",
+  "Inputs": [
+    {
+      "Name": "dataDir",
+      "Path": "/aztextae/data"
+    }
+  ],
+  "OutputRoot": {
+      "Name": "outputDir",
+      "Path": "/aztextae"
+  },
+  "LogRoot": {
+      "Name": "logDir",
+      "Path": "/aztextae"
+  },
+  "Outputs": [],
+  "IsDebug": false,
+  "RackId": "anyConnected",
+  "MinGPUs": 8,
+  "PrevModelPath": null,
+  "ExtraParams": "--use_philly --num_train_epochs 1.0 --beta 1.0 --dim_target_kl 1.0 --ratio_zero 0.5 --ratio_increase 0.25 --dataset wikipedia --per_gpu_train_batch_size 16 --per_gpu_eval_batch_size 1 --block_size 128 --output_dir /aztextae/output/philly_rr1_g8_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25_768 --encoder_model_type bert --encoder_model_name_or_path /aztextae/output/local_lm_vae_bert_gpt_init_768/initial-models-tokenization-enoder-768 --decoder_model_type gpt2 --decoder_model_name_or_path /aztextae/output/local_lm_vae_bert_gpt_init_768/initial-models-tokenization-decoder-768 --do_train --train_data_file /aztextae/data/datasets/wikipedia_json_64_filtered --overwrite_output_dir --save_steps 20000 --logging_steps 100 --use_beta_schedule --latent_size 768",  
+  "IsMemCheck": false,
+  "IsCrossRack": false,
+  "Timeout": null,
+  "Registry": "index.docker.io",
+  "Repository": "vlnres/textae-dist",  
+  "Tag": "v1",
+  "CustomMPIArgs": "env CUDA_CACHE_DISABLE=1 NCCL_IB_HCA=mlx5_0 NCCL_SOCKET_IFNAME=ib0 NCCL_DEBUG=INFO OMP_NUM_THREADS=12",
+  "volumes": {
+    "blob_out": {
+      "_comment": "This will mount testcontainer in the storage account pavermatest inside the container at path '/blob'. The credentials required for accessing storage account pavermatest are below, in the 'credentials' section.",
+      "type": "blobfuseVolume",
+      "storageAccount": "textae",
+      "containerName": "bigtextae",
+      "path": "/aztextae",
+      "options": ["-o", "allow_other"]
+    }
+  },
+  "credentials": {
+    "storageAccounts": {
+      "textae": {
+        "_comment": "Credentials for accessing 'pavermatest' storage account. Secrets can be saved with Philly from your Philly profile page at https://philly/#/userView/. With this the secret doesn't have to be maintained in the user's workspace.",
+        "keyKeyvaultSecretId": "https://phillyusersecrets.vault.azure.net:443/secrets/xiul-textae/e120635ae83147ccad81a90e38fb4e89"
+      }
+    }
+  }
+}
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_hpc/rr3_wiki_beta.json b/Optimus/code/scripts/scripts_hpc/rr3_wiki_beta.json
new file mode 100755
index 0000000000000000000000000000000000000000..9a7a2f56f2e2aa9d263825902762558ef8fab2a3
--- /dev/null
+++ b/Optimus/code/scripts/scripts_hpc/rr3_wiki_beta.json
@@ -0,0 +1,44 @@
+{
+    "version": "2019-10-23",
+    "metadata": {
+        "name": "train_wikipedia_lvlm_b16_beta",
+        "cluster": "rr3",
+        "vc": "msrhyper"
+    },
+    "resources": {
+        "workers": {
+            "type": "skuResource",
+            "sku": "G16",
+            "count": 1,
+            "image": "index.docker.io/chunyl/pytorch-transformers:v1",
+            "commandLine": "cd /aztextae/code && python examples/big_ae/run_lm_vae_pretraining_phdist_beta.py --use_philly --num_train_epochs 1.0 --beta 1.0 --dim_target_kl 1.0 --ratio_zero 0.5 --ratio_increase 0.25 --dataset wikipedia --per_gpu_train_batch_size 16 --per_gpu_eval_batch_size 1 --block_size 128 --output_dir /aztextae/output/philly_rr3hyper_g16_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25 --encoder_model_type bert --encoder_model_name_or_path /aztextae/data/models/bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path /aztextae/data/models/gpt2 --do_train --train_data_file /aztextae/data/datasets/wikipedia_json_64_filtered --overwrite_output_dir --save_steps 20000 --logging_steps 100 --use_beta_schedule",
+            "constraints": [
+                {
+                    "type": "uniqueConstraint",
+                    "tag": "connectivityDomain"
+                }
+            ],
+            "containerArgs": {
+                "shmSize": "4G"
+            }
+        }
+    },
+    "volumes": {
+        "blob_out": {
+            "_comment": "This will mount testcontainer in the storage account pavermatest inside the container at path '/blob'. The credentials required for accessing storage account pavermatest are below, in the 'credentials' section.",
+            "type": "blobfuseVolume",
+            "storageAccount": "textae",
+            "containerName": "bigtextae",
+            "path": "/aztextae",
+            "options": ["-o", "allow_other"]
+        }
+    },
+    "credentials": {
+        "storageAccounts": {
+            "textae": {
+                "_comment": "Credentials for accessing 'pavermatest' storage account. Secrets can be saved with Philly from your Philly profile page at https://philly/#/userView/. With this the secret doesn't have to be maintained in the user's workspace.",
+                "keyKeyvaultSecretId": "https://phillyusersecrets.vault.azure.net:443/secrets/chunyl-textae/7fbf670d8d6943518656d8d0900559c3"
+            }
+        }
+    }
+}
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_hpc/run_lm_ae_bert_gpt_hpc.sh b/Optimus/code/scripts/scripts_hpc/run_lm_ae_bert_gpt_hpc.sh
new file mode 100755
index 0000000000000000000000000000000000000000..9924d88eee612a1290b19e02dcdcfea766b68c4f
--- /dev/null
+++ b/Optimus/code/scripts/scripts_hpc/run_lm_ae_bert_gpt_hpc.sh
@@ -0,0 +1,19 @@
+export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_ae_pretraining.py \
+    --output_dir=../output/local_lmae_wiki2_bert_gpt_hpc \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-uncased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --mlm \
+    --save_steps 200 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=4
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_hpc/run_lm_ae_bert_gpt_p40.sh b/Optimus/code/scripts/scripts_hpc/run_lm_ae_bert_gpt_p40.sh
new file mode 100755
index 0000000000000000000000000000000000000000..2bbde01238b27ff586a9ce7f988fe08648d30d78
--- /dev/null
+++ b/Optimus/code/scripts/scripts_hpc/run_lm_ae_bert_gpt_p40.sh
@@ -0,0 +1 @@
+sleep 14d
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/.run_lm_vae_bert_gpt_debug.sh.swp b/Optimus/code/scripts/scripts_local/.run_lm_vae_bert_gpt_debug.sh.swp
new file mode 100755
index 0000000000000000000000000000000000000000..8713df8141d919f1b7047895c19c7af455b40b3b
Binary files /dev/null and b/Optimus/code/scripts/scripts_local/.run_lm_vae_bert_gpt_debug.sh.swp differ
diff --git a/Optimus/code/scripts/scripts_local/.run_lm_vae_bert_gpt_debug.sh.swx b/Optimus/code/scripts/scripts_local/.run_lm_vae_bert_gpt_debug.sh.swx
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/Optimus/code/scripts/scripts_local/eval_bert_glue_feature.sh b/Optimus/code/scripts/scripts_local/eval_bert_glue_feature.sh
new file mode 100755
index 0000000000000000000000000000000000000000..33e805aef0487e9e60f92a4453e3eedcea3aa307
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_bert_glue_feature.sh
@@ -0,0 +1,50 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+
+export GLUE_DIR=/workspace/data/datasets/glue_data/glue_data
+export TASK_NAME=YELP # SST-2 # CoLA  # SST-2 # MRPC
+
+# python ./examples/run_glue.py \
+#     --model_type bert \
+#     --model_name_or_path bert-base-cased \
+#     --task_name $TASK_NAME \
+#     --do_eval \
+#     --do_lower_case \
+#     --save_steps 200 \
+#     --data_dir $GLUE_DIR/$TASK_NAME \
+#     --max_seq_length 128 \
+#     --per_gpu_eval_batch_size=32   \
+#     --per_gpu_train_batch_size=32   \
+#     --learning_rate 2e-5 \
+#     --num_train_epochs 50.0 \
+#     --percentage_per_label .5 \
+#     --sample_per_label 10000 \
+#     --output_dir ../output/local_features_$TASK_NAME/ \
+#     --use_freeze \
+#     --overwrite_output_dir \
+#     --eval_all_checkpoints \
+#     --collect_feature
+    
+python ./examples/run_glue_vae.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --task_name $TASK_NAME \
+    --do_eval \
+    --do_lower_case \
+    --save_steps 200 \
+    --data_dir $GLUE_DIR/$TASK_NAME \
+    --max_seq_length 128 \
+    --per_gpu_eval_batch_size=32   \
+    --per_gpu_train_batch_size=32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 50.0 \
+    --percentage_per_label .5 \
+    --sample_per_label 10000 \
+    --output_dir ../output/local_features_$TASK_NAME/ \
+    --use_freeze \
+    --overwrite_output_dir \
+    --collect_feature \
+    --checkpoint_dir ../output/philly_rr1_vae_yelp_short_epoch1.0_b1.0_d1.0_r00.5_ra0.25 \
+    --gloabl_step_eval 6939 
+        
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/eval_gpt2_generation.sh b/Optimus/code/scripts/scripts_local/eval_gpt2_generation.sh
new file mode 100755
index 0000000000000000000000000000000000000000..42385359b0f2ed56e422d59844d3b81ed6b456ef
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_gpt2_generation.sh
@@ -0,0 +1,77 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export PYTHONPATH="${PYTHONPATH}:/workspace/code/examples/big_ae"
+# export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+# export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/philly_clm_wiki2_0.0 \
+#     --output_dir=../output/philly_clm_wiki2_0.0 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 400
+
+# philly_vae_news_ft_20epoch_40ae_klon_b1.0_d1_r00.0_ra0.5
+
+export TRAIN_FILE=../data/datasets/news_data/train.txt
+export TEST_FILE=../data/datasets/news_data/valid.txt
+export GPU_ID=1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_gpt2_generation.py \
+    --dataset News \
+    --checkpoint_dir=../output/philly_clm_news_gpt2_epoch10 \
+    --output_dir=../output/local_lm_vae_news_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 11100 \
+    --block_size 256 \
+    --max_seq_length 128 \
+    --num_sents 100 \
+    --temperature 0.5 \
+    --top_p 0.0
+
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 800 \
+#     --total_sents 10    
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/eval_lm_ae_bert_gpt.sh b/Optimus/code/scripts/scripts_local/eval_lm_ae_bert_gpt.sh
new file mode 100755
index 0000000000000000000000000000000000000000..35712685bdf0635017be35fc1afe1036029a3b1e
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_lm_ae_bert_gpt.sh
@@ -0,0 +1,19 @@
+
+export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1,2,3,4,5,6,7
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_ae_pretraining.py \
+    --output_dir=../output/local_lmae_wiki2_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-uncased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 200 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=1
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/eval_lm_causal_bert_gpt.sh b/Optimus/code/scripts/scripts_local/eval_lm_causal_bert_gpt.sh
new file mode 100755
index 0000000000000000000000000000000000000000..3efbc3d55ebc6e3d12223364ff9c54328b8489d0
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_lm_causal_bert_gpt.sh
@@ -0,0 +1,19 @@
+
+export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1,2,3,4,5,6,7
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_causal_pretraining.py \
+    --output_dir=../output/local_lm_causal_wiki2_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-uncased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 200 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=1
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/eval_lm_vae_bert_gpt.sh b/Optimus/code/scripts/scripts_local/eval_lm_vae_bert_gpt.sh
new file mode 100755
index 0000000000000000000000000000000000000000..40d2e450dfe14c36b7fa565343eb41db104d8d0d
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_lm_vae_bert_gpt.sh
@@ -0,0 +1,74 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+
+export TRAIN_FILE=../data/datasets/yahoo_data/train.txt
+export TEST_FILE=../data/datasets/yahoo_data/test.txt
+
+# export GPU_ID=0
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --output_dir=../output/philly_vae_yahoo_b0.25_d0.01_r00.5_ra0.25 \
+    --dataset Yahoo \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 200 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --evaluate_during_training \
+    --per_gpu_train_batch_size=1 \
+    --gloabl_step_eval 6250
+
+
+# export TRAIN_FILE=../data/datasets/snli_data/train.txt
+# export TEST_FILE=../data/datasets/snli_data/test.txt
+
+# export GPU_ID=0
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_snli_bert_gpt \
+#     --dataset Snli \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_eval \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 200 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --evaluate_during_training \
+#     --per_gpu_train_batch_size=1 \
+#     --gloabl_step_eval 12000
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+
+# export GPU_ID=0
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --dataset Snli \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_eval \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 200 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --evaluate_during_training \
+#     --per_gpu_train_batch_size=1 \
+#     --gloabl_step_eval 200
diff --git a/Optimus/code/scripts/scripts_local/eval_optimus_latent_space.sh b/Optimus/code/scripts/scripts_local/eval_optimus_latent_space.sh
new file mode 100755
index 0000000000000000000000000000000000000000..5a8455a35a40943f65a6f763d94641b656b773ba
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_optimus_latent_space.sh
@@ -0,0 +1,185 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+# export TRAIN_FILE=../data/datasets/penn/train.txt
+# export TEST_FILE=../data/datasets/penn/test.txt
+
+
+# export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+# export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/philly_clm_wiki2_0.0 \
+#     --output_dir=../output/philly_clm_wiki2_0.0 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 400
+
+
+export TRAIN_FILE=../data/datasets/debug_data/train.txt
+export TEST_FILE=../data/datasets/debug_data/test.txt
+export GPU_ID=1
+
+
+# # interpolation from pre-trained model on wiki
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25_768_v2 \
+#     --output_dir=../output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25_768_v2 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 508523 \
+#     --block_size 100 \
+#     --max_seq_length 100 \
+#     --latent_size 768 \
+#     --play_mode interpolation \
+#     --num_interpolation_steps 10
+
+
+# # reconstruction from pre-trained model on wiki
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2 \
+#     --output_dir=../output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 400000 \
+#     --block_size 100 \
+#     --max_seq_length 100 \
+#     --latent_size 32 \
+#     --play_mode reconstrction
+
+
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+#     --output_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 31250 \
+#     --block_size 100 \
+#     --max_seq_length 100 \
+#     --latent_size 768 \
+#     --play_mode interpolation \
+#     --num_interpolation_steps 10
+
+# reconstrction
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+#     --output_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 31250 \
+#     --block_size 100 \
+#     --max_seq_length 100 \
+#     --latent_size 768 \
+#     --play_mode reconstrction
+
+
+# interact_with_user_input
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+    --dataset Debug \
+    --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --output_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 31250 \
+    --block_size 100 \
+    --max_seq_length 100 \
+    --latent_size 768 \
+    --interact_with_user_input \
+    --play_mode analogy \
+    --sent_source="a yellow cat likes to chase a long string ." \
+    --sent_target="a yellow cat likes to chase a short string ." \
+    --sent_input="a brown dog likes to eat long pasta ." \
+    --degree_to_target=1.0
+        
+
+
+# interact_with_user_input
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+#     --output_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 31250 \
+#     --block_size 100 \
+#     --max_seq_length 100 \
+#     --latent_size 768 \
+#     --interact_with_user_input \
+#     --play_mode interpolation \
+#     --sent_source="a yellow cat likes to chase a short string ." \
+#     --sent_target="a brown dog likes to eat his food very slowly ." \
+#     --num_interpolation_steps=10
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 800 \
+#     --total_sents 10    
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/eval_vae_generation.sh b/Optimus/code/scripts/scripts_local/eval_vae_generation.sh
new file mode 100755
index 0000000000000000000000000000000000000000..4006afed1f8b083028c1e30edb9ca141b17ea14e
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_vae_generation.sh
@@ -0,0 +1,76 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+# export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+# export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/philly_clm_wiki2_0.0 \
+#     --output_dir=../output/philly_clm_wiki2_0.0 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 400
+
+
+export TRAIN_FILE=../data/datasets/snli_data/train.txt
+export TEST_FILE=../data/datasets/snli_data/test.txt
+export GPU_ID=1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+    --dataset Snli \
+    --checkpoint_dir=../output/philly_vae_snli_epoch20_b1.0_d0.5_r00.5_ra0.25 \
+    --output_dir=../output/local_lm_vae_snli_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 50000 \
+    --block_size 100 \
+    --max_seq_length 100 \
+    --play_mode interpolation \
+    --num_interpolation_steps 20
+
+
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 800 \
+#     --total_sents 10    
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/eval_vae_generation_prior.sh b/Optimus/code/scripts/scripts_local/eval_vae_generation_prior.sh
new file mode 100755
index 0000000000000000000000000000000000000000..bd47dce9d83efc0631bc1b3158372504d888511a
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/eval_vae_generation_prior.sh
@@ -0,0 +1,77 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export PYTHONPATH="${PYTHONPATH}:/workspace/code/examples/big_ae"
+# export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+# export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/philly_clm_wiki2_0.0 \
+#     --output_dir=../output/philly_clm_wiki2_0.0 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 400
+
+# philly_vae_news_ft_20epoch_40ae_klon_b1.0_d1_r00.0_ra0.5
+
+export TRAIN_FILE=../data/datasets/news_data/train.txt
+export TEST_FILE=../data/datasets/news_data/valid.txt
+export GPU_ID=1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_generation_from_prior.py \
+    --dataset News \
+    --checkpoint_dir=../output/philly_vae_news_ft_20epoch_40ae_klon_b1.0_d1_r00.0_ra0.5 \
+    --output_dir=../output/local_lm_vae_news_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 167860 \
+    --block_size 256 \
+    --max_seq_length 128 \
+    --num_sents 10000 \
+    --temperature 0.5 \
+    --top_p 0.0
+
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_encoding_generation.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --per_gpu_eval_batch_size=1 \
+#     --gloabl_step_eval 800 \
+#     --total_sents 10    
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_bert_glue.sh b/Optimus/code/scripts/scripts_local/run_bert_glue.sh
new file mode 100755
index 0000000000000000000000000000000000000000..09ba83dffda6f5e01ef313618b06aee5b5425509
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_bert_glue.sh
@@ -0,0 +1,28 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+
+export GLUE_DIR=/workspace/data/datasets/glue_data/glue_data
+export TASK_NAME=YELP # SST-2 # CoLA  # SST-2 # MRPC
+
+python ./examples/run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --task_name $TASK_NAME \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --save_steps 200 \
+    --data_dir $GLUE_DIR/$TASK_NAME \
+    --max_seq_length 128 \
+    --per_gpu_eval_batch_size=32   \
+    --per_gpu_train_batch_size=32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 50.0 \
+    --percentage_per_label .5 \
+    --sample_per_label 10000 \
+    --output_dir /tmp/$TASK_NAME/ \
+    --use_freeze \
+    --overwrite_output_dir
+    
+    
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_causal_lm_gpt.sh b/Optimus/code/scripts/scripts_local/run_causal_lm_gpt.sh
new file mode 100755
index 0000000000000000000000000000000000000000..d885288150bfd1c7010b3b8232ed908604392021
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_causal_lm_gpt.sh
@@ -0,0 +1,15 @@
+
+export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/run_lm_finetuning.py \
+    --output_dir=output \
+    --model_type=gpt2 \
+    --model_name_or_path=gpt2 \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=2
diff --git a/Optimus/code/scripts/scripts_local/run_data_filtering_wiki.sh b/Optimus/code/scripts/scripts_local/run_data_filtering_wiki.sh
new file mode 100755
index 0000000000000000000000000000000000000000..6226774ae4dae0f2ea59fe95d14e7bf5f068a81e
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_data_filtering_wiki.sh
@@ -0,0 +1,28 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export INPUT_FILE_PATH=../data/datasets/wikipedia_json_64/
+export OUTPUT_FILE_PATH=../data/datasets/wikipedia_json_64_filtered/
+export OUTPUT_DIR=./output/data_preprocessing/log_wikipedia_overlength_filtering/
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_data_filtering.py \
+    --dataset wikipedia \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --input_file_path=$INPUT_FILE_PATH \
+    --output_file_path=$OUTPUT_FILE_PATH \
+    --output_dir=$OUTPUT_DIR \
+    --do_train \
+    --do_eval \
+    --beta 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --num_train_epochs 1.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 2 \
+    --gloabl_step_eval 4 \
+    --block_size 50
diff --git a/Optimus/code/scripts/scripts_local/run_dialog_dataloader.sh b/Optimus/code/scripts/scripts_local/run_dialog_dataloader.sh
new file mode 100755
index 0000000000000000000000000000000000000000..63fe74840dbc4c4a7ded2ab7c143d4cfeda941dc
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_dialog_dataloader.sh
@@ -0,0 +1,27 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/dialog_toy/train.txt
+export TEST_FILE=../data/datasets/dialog_toy/test.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_dialog_dataloader.py \
+    --dataset dialog_toy \
+    --output_dir=../output/local_dialog_dataloader \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_train \
+    --do_eval \
+    --beta 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 2 \
+    --gloabl_step_eval 4 \
+    --block_size 128
diff --git a/Optimus/code/scripts/scripts_local/run_dialog_spacefusion.sh b/Optimus/code/scripts/scripts_local/run_dialog_spacefusion.sh
new file mode 100755
index 0000000000000000000000000000000000000000..5546537b481212ed88d66fcbad9818e3e9e7d5a0
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_dialog_spacefusion.sh
@@ -0,0 +1,136 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/dailydialog_data/train.txt
+export TEST_FILE=../data/datasets/dailydialog_data/test.txt
+export GENERATED_TEXT_FILE=../output/local_dialog_dataloader/eval_text_generation_results.txt
+
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_spacefusion_pretraining.py \
+    --dataset dailydialog \
+    --output_dir=../output/local_dialog_dataloader \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_generation \
+    --do_train \
+    --do_eval \
+    --beta 2.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 2000 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 4 \
+    --block_size 512 \
+    --freeze_bert \
+    --per_gpu_eval_batch_size 1 \
+    --total_sents -1 \
+    --sents_per_cxt 10 \
+    --num_frozen_bert_layer 10 \
+    --num_s2s_bert_layer 2 \
+    --eval_generated_text_file $GENERATED_TEXT_FILE\
+    --checkpoint_dir ../output/philly_rr3scl_g8_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25 \
+    --gloabl_step_eval 760000  \
+    --use_pretrained_model \
+    --use_pretrained_vae
+
+
+# export GENERATED_TEXT_PATH=philly-dailydialog-epoch-2.0-beta-30.0
+# export GENERATED_TEXT_FILE=../output/dialog/$GENERATED_TEXT_PATH/eval_text_generation_results.txt
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_spacefusion_pretraining.py \
+#     --dataset dailydialog \
+#     --output_dir=../output/dialog/$GENERATED_TEXT_PATH \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --beta 2.0 \
+#     --ratio_zero .5 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 2000 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 10 \
+#     --block_size 512 \
+#     --freeze_bert \
+#     --per_gpu_eval_batch_size 1 \
+#     --total_sents -1 \
+#     --sents_per_cxt 10 \
+#     --eval_generated_text_file $GENERATED_TEXT_FILE\
+#     --checkpoint_dir ../output/dialog/$GENERATED_TEXT_PATH \
+#     --gloabl_step_eval 10000  \
+#     --use_pretrained_model \
+#     --do_vis \
+#     --path_ids=../data/datasets/dailydialog_data/dailydialog_data_1000.pt \
+#     --n_pnt=64
+
+
+#     --do_eval \
+
+# export GENERATED_TEXT_FILE=../output/dialog/philly-dailydialog-epoch-5.0-beta-1.0/eval_text_generation_results.txt
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_spacefusion_pretraining.py \
+#     --dataset dailydialog \
+#     --output_dir=../output/dialog/philly-dailydialog-epoch-5.0-beta-1.0 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_eval \
+#     --beta 2.0 \
+#     --ratio_zero .5 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 2000 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 4 \
+#     --block_size 512 \
+#     --freeze_bert11 \
+#     --per_gpu_eval_batch_size 1 \
+#     --total_sents -1 \
+#     --sents_per_cxt 10 \
+#     --eval_generated_text_file $GENERATED_TEXT_FILE\
+#     --checkpoint_dir ../output/dialog/philly-dailydialog-epoch-5.0-beta-1.0 \
+#     --gloabl_step_eval 26000  \
+#     --use_pretrained_model
+
+# export GENERATED_TEXT_FILE=../output/dialog/philly-dailydialog-full-epoch-5.0-beta-1.0/eval_text_generation_results.txt
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_spacefusion_pretraining.py \
+#     --dataset dailydialog \
+#     --output_dir=../output/dialog/philly-dailydialog-full-epoch-5.0-beta-1.0 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_eval \
+#     --beta 2.0 \
+#     --ratio_zero .5 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 2000 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 4 \
+#     --block_size 512 \
+#     --freeze_bert11 \
+#     --per_gpu_eval_batch_size 1 \
+#     --total_sents -1 \
+#     --sents_per_cxt 10 \
+#     --eval_generated_text_file $GENERATED_TEXT_FILE \
+#     --checkpoint_dir ../output/dialog/philly-dailydialog-full-epoch-5.0-beta-1.0 \
+#     --gloabl_step_eval 26000  \
+#     --use_pretrained_model
+
+    # 
diff --git a/Optimus/code/scripts/scripts_local/run_dialog_spacefusion_switchboard.sh b/Optimus/code/scripts/scripts_local/run_dialog_spacefusion_switchboard.sh
new file mode 100755
index 0000000000000000000000000000000000000000..0d3022c4261ce4725826bcd6f1891ab150944c43
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_dialog_spacefusion_switchboard.sh
@@ -0,0 +1,72 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/switchboard/train.txt
+export TEST_FILE=../data/datasets/switchboard/test.txt.1ref
+export GENERATED_TEXT_FILE=../output/dialog/local-dialog-switchboard/eval_text_generation_results.txt
+
+export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_spacefusion_pretraining.py \
+#     --dataset dailydialog \
+#     --output_dir=../output/dialog/local-dialog-switchboard \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_generation \
+#     --do_train \
+#     --do_eval \
+#     --beta 2.0 \
+#     --ratio_zero .5 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 5.0 \
+#     --save_steps 2000 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 4 \
+#     --block_size 512 \
+#     --freeze_bert11 \
+#     --per_gpu_eval_batch_size 1 \
+#     --total_sents -1 \
+#     --sents_per_cxt 10 \
+#     --eval_generated_text_file $GENERATED_TEXT_FILE\
+#     --checkpoint_dir ../output/philly_rr3scl_g8_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25 \
+#     --gloabl_step_eval 760000  \
+#     --use_pretrained_model \
+#     --use_pretrained_vae
+
+
+
+export GENERATED_TEXT_PATH=philly-switchboard-epoch-5.0-beta-1.0
+export GENERATED_TEXT_FILE=../output/dialog/$GENERATED_TEXT_PATH/eval_text_generation_results.txt
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_spacefusion_pretraining.py \
+    --dataset switchboard \
+    --output_dir=../output/dialog/$GENERATED_TEXT_PATH \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --beta 2.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 2000 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 4 \
+    --block_size 512 \
+    --freeze_bert11 \
+    --per_gpu_eval_batch_size 1 \
+    --total_sents -1 \
+    --sents_per_cxt 10 \
+    --eval_generated_text_file $GENERATED_TEXT_FILE\
+    --checkpoint_dir ../output/dialog/$GENERATED_TEXT_PATH \
+    --gloabl_step_eval 94000  \
+    --use_pretrained_model
+
diff --git a/Optimus/code/scripts/scripts_local/run_ft_lm_vae_optimus.sh b/Optimus/code/scripts/scripts_local/run_ft_lm_vae_optimus.sh
new file mode 100755
index 0000000000000000000000000000000000000000..4417e80bdc31cb2d6179d0927d093f1cd21c2a83
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_ft_lm_vae_optimus.sh
@@ -0,0 +1,272 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/LM/local_lm_vae_debug_optimus \
+#     --dataset Debug \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --dim_target_kl 0.5\
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 100.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=10 \
+#     --block_size 100 \
+#     --use_pretrained_model \
+#     --use_pretrained_vae \
+#     --checkpoint_dir ../output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25_32_v2 \
+#     --gloabl_step_eval 320000
+
+
+
+# export TRAIN_FILE=../data/datasets/yelp_data/train.txt
+# export TEST_FILE=../data/datasets/yelp_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_yelp_bert_gpt \
+#     --dataset Yelp \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2 \
+#     --block_size 300
+
+# export TRAIN_FILE=../data/datasets/yahoo_data/train.txt
+# export TEST_FILE=../data/datasets/yahoo_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_yahoo_bert_gpt \
+#     --dataset Yahoo \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2 \
+#     --block_size 300
+
+# export TRAIN_FILE=../data/datasets/snli_data/train.txt
+# export TEST_FILE=../data/datasets/snli_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_snli_bert_gpt \
+#     --dataset Snli \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=10 \
+#     --block_size 100
+
+export TRAIN_FILE=../data/datasets/penn/train.txt
+export TEST_FILE=../data/datasets/penn/test.txt
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --output_dir=../output/local_lm_vae_penn_bert_gpt \
+    --dataset Penn \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --beta 1.0 \
+    --ratio_zero 0.5 \
+    --ratio_increase 0.25 \
+    --do_train \
+    --do_eval \
+    --fb_mode 1 \
+    --dim_target_kl 0.5\
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 1000 \
+    --logging_steps 1000 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=10 \
+    --block_size 100 \
+    --use_pretrained_model \
+    --use_pretrained_vae \
+    --checkpoint_dir ../output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25_32_v2 \
+    --gloabl_step_eval 320000
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_penn_bert_gpt \
+#     --dataset Penn \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --dim_target_kl 0.5\
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=10 \
+#     --block_size 100
+
+# export TRAIN_FILE=../data/datasets/wikipedia/wikipedia.segmented.nltk.txt
+# export TEST_FILE=../data/datasets/wikipedia/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_wikipedia_bert_gpt \
+#     --dataset wikipedia \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=20 \
+#     --block_size 100
+
+
+
+# export TRAIN_FILE=../data/datasets/news_data/train.txt
+# export TEST_FILE=../data/datasets/news_data/test.txt
+
+
+# export TRAIN_FILE=../data/datasets/glue_data/glue_data/YELP/train.txt
+# export TEST_FILE=../data/datasets/glue_data/glue_data/YELP/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --dataset News \
+#     --checkpoint_dir ../output/philly_scl_b16_g8_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1_200 \
+#     --gloabl_step_eval 880000 \
+#     --output_dir=../output/local_lm_vae_news_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_train \
+#     --do_eval \
+#     --beta 1.0 \
+#     --dim_target_kl 1.0 \
+#     --ratio_zero .0 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 5000 \
+#     --logging_steps 200 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 16 \
+#     --block_size 256 \
+#     --use_pretrained_model
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --dataset News \
+#     --output_dir=../output/local_lm_vae_bert_gpt_init_768 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_train \
+#     --do_eval \
+#     --beta 1.0 \
+#     --dim_target_kl 1.0 \
+#     --ratio_zero .0 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 5000 \
+#     --logging_steps 200 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 16 \
+#     --block_size 256 \
+#     --latent_size 768 \
+#     --save_bert_gpt_init
+
+
+# export TRAIN_FILE=../data/datasets/glue_data/train.txt
+# export TEST_FILE=../data/datasets/glue_data/train.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --dataset Glue \
+#     --checkpoint_dir ../output/philly_scl_b16_g8_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1_200 \
+#     --gloabl_step_eval 880000 \
+#     --output_dir=../output/local_lm_vae_glue_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_train \
+#     --beta 1.0 \
+#     --dim_target_kl 1.0 \
+#     --ratio_zero .5 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 5000 \
+#     --logging_steps 200 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 16 \
+#     --block_size 256 \
+#     --use_pretrained_model
diff --git a/Optimus/code/scripts/scripts_local/run_generation_gpt2.sh b/Optimus/code/scripts/scripts_local/run_generation_gpt2.sh
new file mode 100755
index 0000000000000000000000000000000000000000..6f012b9deb5f18a4a92c9e2ced11ddae8d38a5cd
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_generation_gpt2.sh
@@ -0,0 +1,4 @@
+python ./examples/run_generation.py \
+    --model_type=gpt2 \
+    --length=20 \
+    --model_name_or_path=gpt2 \
diff --git a/Optimus/code/scripts/scripts_local/run_glue_data_integration.sh b/Optimus/code/scripts/scripts_local/run_glue_data_integration.sh
new file mode 100755
index 0000000000000000000000000000000000000000..9029908c97a0b21c932b228155007ee2066bf70f
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_glue_data_integration.sh
@@ -0,0 +1,16 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+
+export GLUE_DIR=/workspace/data/datasets/glue_data/glue_data
+
+python ./examples/run_glue_data_integration.py \
+    --output_dir ../output/local_glue_data \
+    --data_dir $GLUE_DIR\
+    --model_type bert \
+    --model_name_or_path bert-base-cased \
+    --percentage_per_label .5 \
+    --use_freeze \
+    --overwrite_output_dir
+    
+    
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_gpt_generation.sh b/Optimus/code/scripts/scripts_local/run_gpt_generation.sh
new file mode 100755
index 0000000000000000000000000000000000000000..67dbf3f350a34da9df8f070fc71d93c7f6200acb
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_gpt_generation.sh
@@ -0,0 +1,6 @@
+
+export GPU_ID=0
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/run_generation.py \
+    --model_type=gpt2 \
+    --model_name_or_path=gpt2
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_lm_ae_bert_gpt.sh b/Optimus/code/scripts/scripts_local/run_lm_ae_bert_gpt.sh
new file mode 100755
index 0000000000000000000000000000000000000000..2bbbd8f71b78c7694295a82b5b37dd99f6df38b7
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_ae_bert_gpt.sh
@@ -0,0 +1,20 @@
+
+export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1,2,3,4,5,6,7
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_ae_pretraining.py \
+    --output_dir=../output/local_lmae_wiki2_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-uncased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 200 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=3
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_lm_causal_bert_gpt.sh b/Optimus/code/scripts/scripts_local/run_lm_causal_bert_gpt.sh
new file mode 100755
index 0000000000000000000000000000000000000000..9fa9f0b8d1b1acae3cfcaf2623169ab407ce8b8a
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_causal_bert_gpt.sh
@@ -0,0 +1,20 @@
+
+export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1,2,3,4,5,6,7
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_causal_pretraining.py \
+    --output_dir=../output/local_lm_causal_wiki2_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-uncased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 200 \
+    --logging_steps 100 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=3
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_lm_ctrl.sh b/Optimus/code/scripts/scripts_local/run_lm_ctrl.sh
new file mode 100755
index 0000000000000000000000000000000000000000..4305fdb9e44dd0834355ff1f2678173b570afe65
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_ctrl.sh
@@ -0,0 +1,151 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+# export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+# export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+
+
+
+export TRAIN_FILE=../data/datasets/yelp_style/sentiment.train.text
+export TEST_FILE=../data/datasets/yelp_style/sentiment.test.text.1000sents
+
+
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_label_ctrl_gen.py \
+#     --output_dir ../output/local_lm_vae_label_ctrl_gen \
+#     --checkpoint_dir ../output/philly_cara_yelp_50.0 \
+#     --gloabl_step_eval 43650  \
+#     --dataset Yelp \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --num_train_epochs 1.0 \
+#     --overwrite_output_dir 1 \
+#     --per_gpu_train_batch_size=32 \
+#     --block_size 300 \
+#     --do_eval
+
+
+
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_label_ctrl_gen.py \
+    --output_dir ../output/local_lm_vae_label_ctrl_gen \
+    --checkpoint_dir ../output/local_lm_vae_label_ctrl_gen \
+    --gloabl_step_eval 6989  \
+    --use_pretrained_model \
+    --dataset Yelp \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --save_steps 1000 \
+    --logging_steps 1000 \
+    --num_train_epochs 1.0 \
+    --overwrite_output_dir 1 \
+    --per_gpu_train_batch_size=32 \
+    --block_size 300 \
+    --do_eval
+
+
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_label_ctrl_gen.py \
+#     --output_dir ../output/local_lm_vae_label_ctrl_gen \
+#     --checkpoint_dir ../output/philly_rr3scl_g8_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25 \
+#     --gloabl_step_eval 760000  \
+#     --use_pretrained_model \
+#     --use_pretrained_vae \
+#     --dataset Yelp \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --num_train_epochs 1.0 \
+#     --overwrite_output_dir 1 \
+#     --per_gpu_train_batch_size=32 \
+#     --block_size 300 \
+#     --do_eval \
+#     --do_train 
+
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_label_ctrl_gen.py \
+#     --output_dir ../output/local_lm_vae_label_ctrl_gen \
+#     --checkpoint_dir ../output/philly_scl_b16_g8_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1_200 \
+#     --dataset Yelp \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --gloabl_step_eval 880000 \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --num_train_epochs 1.0 \
+#     --overwrite_output_dir 1 \
+#     --per_gpu_train_batch_size=32 \
+#     --use_pretrained_model  \
+#     --block_size 300 \
+#     --do_train \
+#     --do_eval 
+
+
+
+# export TRAIN_FILE=../data/datasets/snli_data/train.txt
+# export TEST_FILE=../data/datasets/snli_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_snli_bert_gpt \
+#     --dataset Snli \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=10 \
+#     --block_size 100
+
+
+# export TRAIN_FILE=../data/datasets/wikipedia/wikipedia.segmented.nltk.txt
+# export TEST_FILE=../data/datasets/wikipedia/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_wikipedia_bert_gpt \
+#     --dataset wikipedia \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=20 \
+#     --block_size 100
diff --git a/Optimus/code/scripts/scripts_local/run_lm_gpt_debug.sh b/Optimus/code/scripts/scripts_local/run_lm_gpt_debug.sh
new file mode 100755
index 0000000000000000000000000000000000000000..1d4ac13eaa0adba98d1441a01bd543308df3a810
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_gpt_debug.sh
@@ -0,0 +1,49 @@
+
+# export TRAIN_FILE=../data/datasets/penn/train.txt
+# export TEST_FILE=../data/datasets/penn/test.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_finetuning_baseline.py \
+#     --output_dir=../output/local_lm_gpt_penn \
+#     --dataset Yahoo \
+#     --model_type=gpt2 \
+#     --model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_train \
+#     --do_eval \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 2.0 \
+#     --save_steps 1000 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2 \
+#     --gloabl_step_eval 600
+
+
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/debug_data/train.txt
+export TEST_FILE=../data/datasets/debug_data/test.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_gpt2_training.py \
+    --dataset Debug \
+    --output_dir=../output/local_gpt2_debug \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_train \
+    --do_eval \
+    --beta 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 20.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 1 \
+    --gloabl_step_eval 4 \
+    --block_size 50 \
diff --git a/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt.sh b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt.sh
new file mode 100755
index 0000000000000000000000000000000000000000..26b15371c433fed4f490f4d47b4e01a206c800db
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt.sh
@@ -0,0 +1,234 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+# export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+# export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_wiki2_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --do_train \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 200 \
+#     --logging_steps 100 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=3
+
+
+# export TRAIN_FILE=../data/datasets/yelp_data/train.txt
+# export TEST_FILE=../data/datasets/yelp_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_yelp_bert_gpt \
+#     --dataset Yelp \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2 \
+#     --block_size 300
+
+# export TRAIN_FILE=../data/datasets/yahoo_data/train.txt
+# export TEST_FILE=../data/datasets/yahoo_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_yahoo_bert_gpt \
+#     --dataset Yahoo \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2 \
+#     --block_size 300
+
+# export TRAIN_FILE=../data/datasets/snli_data/train.txt
+# export TEST_FILE=../data/datasets/snli_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_snli_bert_gpt \
+#     --dataset Snli \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=10 \
+#     --block_size 100
+
+# export TRAIN_FILE=../data/datasets/penn/train.txt
+# export TEST_FILE=../data/datasets/penn/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --output_dir=../output/local_lm_vae_penn_bert_gpt \
+#     --dataset Penn \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --dim_target_kl 0.5\
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=10 \
+#     --block_size 100
+
+
+
+# export TRAIN_FILE=../data/datasets/wikipedia/wikipedia.segmented.nltk.txt
+# export TEST_FILE=../data/datasets/wikipedia/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_wikipedia_bert_gpt \
+#     --dataset wikipedia \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=20 \
+#     --block_size 100
+
+
+
+# export TRAIN_FILE=../data/datasets/news_data/train.txt
+# export TEST_FILE=../data/datasets/news_data/test.txt
+
+
+# export TRAIN_FILE=../data/datasets/glue_data/glue_data/YELP/train.txt
+# export TEST_FILE=../data/datasets/glue_data/glue_data/YELP/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --dataset News \
+#     --checkpoint_dir ../output/philly_scl_b16_g8_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1_200 \
+#     --gloabl_step_eval 880000 \
+#     --output_dir=../output/local_lm_vae_news_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_train \
+#     --do_eval \
+#     --beta 1.0 \
+#     --dim_target_kl 1.0 \
+#     --ratio_zero .0 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 5000 \
+#     --logging_steps 200 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 16 \
+#     --block_size 256 \
+#     --use_pretrained_model
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --dataset News \
+#     --output_dir=../output/local_lm_vae_bert_gpt_init_768 \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_train \
+#     --do_eval \
+#     --beta 1.0 \
+#     --dim_target_kl 1.0 \
+#     --ratio_zero .0 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 5000 \
+#     --logging_steps 200 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 16 \
+#     --block_size 256 \
+#     --latent_size 768 \
+#     --save_bert_gpt_init
+
+
+export TRAIN_FILE=../data/datasets/glue_data/train.txt
+export TEST_FILE=../data/datasets/glue_data/train.txt
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --dataset Glue \
+    --checkpoint_dir ../output/philly_scl_b16_g8_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1_200 \
+    --gloabl_step_eval 880000 \
+    --output_dir=../output/local_lm_vae_glue_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_train \
+    --beta 1.0 \
+    --dim_target_kl 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 5000 \
+    --logging_steps 200 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 16 \
+    --block_size 256 \
+    --use_pretrained_model
diff --git a/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_debug.sh b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_debug.sh
new file mode 100755
index 0000000000000000000000000000000000000000..4851e851280ee1f47db9a8fd2c894914890c48db
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_debug.sh
@@ -0,0 +1,26 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/debug_data/train.txt
+export TEST_FILE=../data/datasets/debug_data/test.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --dataset Debug \
+    --output_dir=../output/local_lm_vae_debug_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_train \
+    --do_eval \
+    --beta 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 5.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 3 \
+    --block_size 50
diff --git a/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_debug.sh~ b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_debug.sh~
new file mode 100755
index 0000000000000000000000000000000000000000..a59a29a0d96b0a0121a8f4e2de69de5be7da5965
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_debug.sh~
@@ -0,0 +1,27 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/debug_data/train.txt
+export TEST_FILE=../data/datasets/debug_data/test.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --dataset Debug \
+    --output_dir=../output/local_lm_vae_debug_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_train \
+    --do_eval \
+    --beta 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 20.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=4 \
+    --gloabl_step_eval 4 \
+    --block_size 50
diff --git a/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_distributed.sh b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_distributed.sh
new file mode 100755
index 0000000000000000000000000000000000000000..635e5cf5b01322f9964a13c35c7d439de2819411
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_vae_bert_gpt_distributed.sh
@@ -0,0 +1,85 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+# export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+# export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1
+CUDA_VISIBLE_DEVICES=$GPU_ID python -m torch.distributed.launch --nproc_per_node 2 examples/big_ae/run_lm_vae_pretraining_phdist.py \
+--num_train_epochs 1.0 --beta 0.0 --dim_target_kl 1.0 --ratio_zero 0.5 --ratio_increase 0.25 --latent_size 32 --dataset wikipedia \
+--per_gpu_train_batch_size 24 --per_gpu_eval_batch_size 1 --block_size 128 \
+--output_dir ../output/pretrain/debug/g2_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32 \
+--encoder_model_type bert --encoder_model_name_or_path ../data/models/local_bert_gpt_init/initial-models-tokenization-enoder-32 \
+--decoder_model_type gpt2 --decoder_model_name_or_path ../data/models/local_bert_gpt_init/initial-models-tokenization-decoder-32 \
+--do_train --train_data_file ../data/datasets/wikipedia_json_64_filtered --overwrite_output_dir --save_steps 20000 --logging_steps 100 --use_beta_schedule
+
+
+# export TRAIN_FILE=../data/datasets/yelp_data/train.txt
+# export TEST_FILE=../data/datasets/yelp_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_yelp_bert_gpt \
+#     --dataset Yelp \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --do_train \
+#     --do_eval \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2
+
+
+# export TRAIN_FILE=../data/datasets/yahoo_data/train.txt
+# export TEST_FILE=../data/datasets/yahoo_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_yahoo_bert_gpt \
+#     --dataset Yahoo \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 0.25 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.1 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2
+
+
+# export TRAIN_FILE=../data/datasets/snli_data/train.txt
+# export TEST_FILE=../data/datasets/snli_data/test.txt
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python -m torch.distributed.launch --nproc_per_node 2 \
+#     examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_snli_bert_gpt_distributed \
+#     --dataset Snli \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-uncased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 1.0 \
+#     --ratio_zero 0.5 \
+#     --ratio_increase 0.25 \
+#     --do_train \
+#     --do_eval \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 1000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=30 \
+#     --block_size 100
diff --git a/Optimus/code/scripts/scripts_local/run_lm_vae_lvm_debug.sh b/Optimus/code/scripts/scripts_local/run_lm_vae_lvm_debug.sh
new file mode 100755
index 0000000000000000000000000000000000000000..731858a1d1078a72f840ca45f4eeefaee39a3fbe
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_vae_lvm_debug.sh
@@ -0,0 +1,64 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/penn/train.txt
+export TEST_FILE=../data/datasets/penn/test.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --dataset Debug \
+    --checkpoint_dir=../output/inject_latent/philly_vae_penn_b0.0_emb_1_mem_0 \
+    --output_dir=../output/inject_latent/philly_vae_penn_b0.0_emb_1_mem_0 \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval_rec \
+    --beta 0.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 2.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 10 \
+    --per_gpu_eval_batch_size 10 \
+    --gloabl_step_eval 4 \
+    --block_size 100 \
+    --latent_size 32 \
+    --latent_as_gpt_emb 1 \
+    --latent_as_gpt_memory 0 \
+    --gloabl_step_eval 13145
+
+
+
+# export TRAIN_FILE=../data/datasets/debug_data/train.txt
+# export TEST_FILE=../data/datasets/debug_data/test.txt
+# export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+#     --dataset Debug \
+#     --checkpoint_dir=../output/philly_rr1_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1 \
+#     --output_dir=../output/local_lm_vae_debug_bert_gpt \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_train \
+#     --do_eval \
+#     --beta 1.0 \
+#     --ratio_zero .5 \
+#     --ratio_increase 0.25 \
+#     --eval_data_file=$TEST_FILE \
+#     --num_train_epochs 2.0 \
+#     --save_steps 20 \
+#     --logging_steps 4 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size 1 \
+#     --gloabl_step_eval 4 \
+#     --block_size 50 \
+#     --latent_size 32 \
+#     --latent_as_gpt_memory 0
+
diff --git a/Optimus/code/scripts/scripts_local/run_lm_vae_lvm_init_debug.sh b/Optimus/code/scripts/scripts_local/run_lm_vae_lvm_init_debug.sh
new file mode 100755
index 0000000000000000000000000000000000000000..15461e07232dd770ef6e2cdd3d9a2365030036e5
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_lm_vae_lvm_init_debug.sh
@@ -0,0 +1,29 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/debug_data/train.txt
+export TEST_FILE=../data/datasets/debug_data/test.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --dataset Debug \
+    --checkpoint_dir=../output/philly_rr1_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1 \
+    --output_dir=../output/local_lm_vae_debug_bert_gpt \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_train \
+    --do_eval \
+    --beta 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 2.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 1 \
+    --gloabl_step_eval 60000 \
+    --block_size 128 
+    # --use_pretrained_model
diff --git a/Optimus/code/scripts/scripts_local/run_maksed_lm_bert.sh b/Optimus/code/scripts/scripts_local/run_maksed_lm_bert.sh
new file mode 100755
index 0000000000000000000000000000000000000000..6c5ba8dca1f82bbc3f0a64b9a7d96b2f69780ae7
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_maksed_lm_bert.sh
@@ -0,0 +1,29 @@
+
+export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+export GPU_ID=0,1
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/run_lm_finetuning.py \
+#     --output_dir=../output/local_mlm_wiki2 \
+#     --model_type=roberta \
+#     --model_name_or_path=roberta-base \
+#     --do_train \
+#     --train_data_file=$TRAIN_FILE \
+#     --do_eval \
+#     --eval_data_file=$TEST_FILE \
+#     --mlm \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=2
+
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/run_lm_finetuning.py \
+    --output_dir=../output/local_mlm_wiki2 \
+    --model_type=bert \
+    --model_name_or_path=bert-base-uncased \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --mlm \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=2
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_vae_glue.sh b/Optimus/code/scripts/scripts_local/run_vae_glue.sh
new file mode 100755
index 0000000000000000000000000000000000000000..cd3f26f54295bc3021cb9f6045272d8558d82408
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_vae_glue.sh
@@ -0,0 +1,30 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+
+export GLUE_DIR=/workspace/data/datasets/glue_data/glue_data
+export TASK_NAME=YELP # SST-2 # CoLA  # SST-2 # MRPC
+
+python ./examples/run_glue_vae.py \
+    --model_type bert \
+    --checkpoint_dir ../output/philly_rr1_vae_wikipedia_pretraining_b0.0_d1.0_r01.0_ra0.1 \
+    --gloabl_step_eval 60000 \
+    --model_name_or_path bert-base-cased \
+    --task_name $TASK_NAME \
+    --do_train \
+    --do_eval \
+    --do_lower_case \
+    --save_steps 200 \
+    --data_dir $GLUE_DIR/$TASK_NAME \
+    --max_seq_length 128 \
+    --per_gpu_eval_batch_size=32   \
+    --per_gpu_train_batch_size=32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 50.0 \
+    --percentage_per_label .5 \
+    --sample_per_label 10000 \
+    --output_dir ../output/vae_$TASK_NAME/ \
+    --use_freeze \
+    --overwrite_output_dir
+    
+    
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_local/run_vae_pretraining.sh b/Optimus/code/scripts/scripts_local/run_vae_pretraining.sh
new file mode 100755
index 0000000000000000000000000000000000000000..b9f5d38d76e47fbdabaac8a6c8968a6c803228ba
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/run_vae_pretraining.sh
@@ -0,0 +1,44 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+export TRAIN_FILE=../data/datasets/wikipedia_json_64/
+
+# CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_pretraining.py \
+#     --output_dir=../output/local_lm_vae_wikipedia_pretraining \
+#     --dataset wikipedia \
+#     --encoder_model_type=bert \
+#     --encoder_model_name_or_path=bert-base-cased \
+#     --decoder_model_type=gpt2 \
+#     --decoder_model_name_or_path=gpt2 \
+#     --beta 0.0 \
+#     --ratio_zero 1.0 \
+#     --ratio_increase 0.1 \
+#     --do_train \
+#     --fb_mode 1 \
+#     --train_data_file=$TRAIN_FILE \
+#     --num_train_epochs 1.0 \
+#     --save_steps 10000 \
+#     --logging_steps 1000 \
+#     --overwrite_output_dir \
+#     --per_gpu_train_batch_size=8 \
+#     --block_size 256
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python  -m torch.distributed.launch --nproc_per_node 2 examples/big_ae/run_lm_vae_pretraining_distributed.py \
+    --output_dir=../output/local_lm_vae_wikipedia_pretraining \
+    --dataset wikipedia \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --beta 0.0 \
+    --ratio_zero 1.0 \
+    --ratio_increase 0.1 \
+    --do_train \
+    --fb_mode 1 \
+    --train_data_file=$TRAIN_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 10000 \
+    --logging_steps 1000 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=8 \
+    --block_size 256
diff --git a/Optimus/code/scripts/scripts_local/save_init_bert_gpt2.sh b/Optimus/code/scripts/scripts_local/save_init_bert_gpt2.sh
new file mode 100755
index 0000000000000000000000000000000000000000..a9462831f393e52aeff3896be920ea3b5a0f6603
--- /dev/null
+++ b/Optimus/code/scripts/scripts_local/save_init_bert_gpt2.sh
@@ -0,0 +1,33 @@
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+
+export TRAIN_FILE=../data/datasets/penn/train.txt
+export TEST_FILE=../data/datasets/penn/test.txt
+export GPU_ID=0,1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --dataset Penn \
+    --checkpoint_dir=../output/local_bert_gpt_init \
+    --output_dir=../output/local_bert_gpt_init \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --do_train \
+    --do_eval \
+    --beta 1.0 \
+    --ratio_zero .5 \
+    --ratio_increase 0.25 \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 2.0 \
+    --save_steps 20 \
+    --logging_steps 4 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size 1 \
+    --gloabl_step_eval 60000 \
+    --block_size 128 \
+    --latent_as_gpt_emb 1 \
+    --latent_as_gpt_memory 1 \
+    --save_bert_gpt_init \
+    --latent_size 768
+    # --use_pretrained_model
diff --git a/Optimus/code/scripts/scripts_philly/eval_mlm_wiki2.yaml b/Optimus/code/scripts/scripts_philly/eval_mlm_wiki2.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..f56495edf33bcf2aa1678f330bdafda8bfaf67d7
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/eval_mlm_wiki2.yaml
@@ -0,0 +1,47 @@
+description: Evaluate VAE LM on Wiki2 Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: resrchprojvc6
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: rr1
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - python examples/big_ae/run_lm_vae_pretraining.py --output_dir ../output/local_lm_vae_wiki2_bert_gpt --encoder_model_type bert --encoder_model_name_or_path bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir --per_gpu_train_batch_size {bs_option}
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [1] # [top,bottom]
diff --git a/Optimus/code/scripts/scripts_philly/eval_vae_wiki2.yaml b/Optimus/code/scripts/scripts_philly/eval_vae_wiki2.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..689e986590540661062460d7a81635bc4aceaf65
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/eval_vae_wiki2.yaml
@@ -0,0 +1,51 @@
+description: Evaluate VAE on Wiki2 Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}_b_{beta_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta {beta_option} --per_gpu_train_batch_size {bs_option} --output_dir ../output/philly_clm_wiki2_{beta_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size 1 --overwrite_output_dir  --save_steps 200 --logging_steps 100
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [1] # [top,bottom]
+    - name: beta_option
+      spec: discrete
+      values: [0.0,0.25,0.5,0.75,1.0] # [top,bottom]
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_philly/eval_vae_yahoo.yaml b/Optimus/code/scripts/scripts_philly/eval_vae_yahoo.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..18a1bd93b19ce2291bd1044a3f9a1c836cfcffea
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/eval_vae_yahoo.yaml
@@ -0,0 +1,61 @@
+description: Evaluate VAE on Yahoo Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: msrlabspvc11
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_{bs_option:.0f}_b_{beta_option:.2f}
+    sku: G2 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure    
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs 1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Yahoo --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 512 --output_dir ../output/philly_vae_yahoo_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --train_data_file ../data/datasets/yahoo_data/train.txt --do_eval --eval_data_file ../data/datasets/yahoo_data/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps 100 --gloabl_step_eval 6250
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.25,1.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [0.01,0.05,0.25,0.5,1.0] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [0.5] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.25] # 
diff --git a/Optimus/code/scripts/scripts_philly/eval_vae_yelp.yaml b/Optimus/code/scripts/scripts_philly/eval_vae_yelp.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..a112a985c919bbb6ba6a7baa0d47b9508365ffa1
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/eval_vae_yelp.yaml
@@ -0,0 +1,67 @@
+description: Evaluate VAE on Yelp Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_{bs_option:.0f}_b_{beta_option:.2f}
+    sku: G1 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure    
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta {beta_option} --dim_target_kl {dim_target_kl_option} --gloabl_step_eval 8334 --dataset Yelp --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --output_dir ../output/philly_vae_yelp_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --train_data_file ../data/datasets/yelp_data/train.txt --do_eval --eval_data_file ../data/datasets/yelp_data/test.txt --overwrite_output_dir  --save_steps 1000 --logging_steps 100
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # [top,bottom]
+    # - name: beta_option
+    #   spec: discrete
+    #   values: [0.25,1.0] # [top,bottom]
+    # - name: dim_target_kl_option
+    #   spec: discrete
+    #   values: [0.5,1.0,2.0] # [top,bottom]
+    - name: beta_option
+      spec: discrete
+      values: [0.25,1.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [0.01,0.05,0.1,0.25] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [0.5] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.25] # 
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_philly/results/eval_mlm_wiki2/philly.yaml b/Optimus/code/scripts/scripts_philly/results/eval_mlm_wiki2/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..427853efa883f95124ce6091844caa095801d4bf
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/eval_mlm_wiki2/philly.yaml
@@ -0,0 +1,66 @@
+version: 4.1.8
+dry_run: false
+exp_name: eval_mlm_wiki2
+description: Evaluate VAE LM on Wiki2 Dataset
+timestamp: '2019-09-26T00:16:52.178691-07:00'
+auth:
+  cluster: et1
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 1
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - python examples/big_ae/run_lm_vae_pretraining.py --output_dir ../output/local_lm_vae_wiki2_bert_gpt
+      --model_type bert --model_name_or_path bert-base-uncased --train_data_file ../data/datasets/wikitext-2/train.txt
+      --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir
+      --per_gpu_train_batch_size {bs_option}
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_eval_mlm_wiki2_1_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - python examples/big_ae/run_lm_vae_pretraining.py --output_dir ../output/local_lm_vae_wiki2_bert_gpt
+    --model_type bert --model_name_or_path bert-base-uncased --train_data_file ../data/datasets/wikitext-2/train.txt
+    --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir
+    --per_gpu_train_batch_size 1
+  id: application_1568929298048_1911
+  results_dir: /mnt/_output/pt-results/2019-09-26/application_1568929298048_1911
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      container_name: bigtextae
+      storage_account_name: textae
+      mount_path: /mnt/_default
+      use_phillyfs: false
+    _output:
+      container_name: bigtextae
+      storage_account_name: textae
+      mount_path: /mnt/_output
+      use_phillyfs: false
diff --git a/Optimus/code/scripts/scripts_philly/results/eval_mlm_wiki2_resrchprojvc6/philly.yaml b/Optimus/code/scripts/scripts_philly/results/eval_mlm_wiki2_resrchprojvc6/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..844d64a2c126115fb48fd3013108007074f57fbb
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/eval_mlm_wiki2_resrchprojvc6/philly.yaml
@@ -0,0 +1,68 @@
+version: 4.1.8
+dry_run: false
+exp_name: eval_mlm_wiki2_resrchprojvc6
+description: Evaluate VAE LM on Wiki2 Dataset
+timestamp: '2019-09-26T17:02:02.480204-07:00'
+auth:
+  cluster: rr1
+  vc: resrchprojvc6
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 1
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - python examples/big_ae/run_lm_vae_pretraining.py --output_dir ../output/local_lm_vae_wiki2_bert_gpt
+      --encoder_model_type bert --encoder_model_name_or_path bert-base-uncased --decoder_model_type
+      gpt2 --decoder_model_name_or_path gpt2 --train_data_file ../data/datasets/wikitext-2/train.txt
+      --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir
+      --per_gpu_train_batch_size {bs_option}
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_eval_mlm_wiki2_resrchprojvc6_1_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - python examples/big_ae/run_lm_vae_pretraining.py --output_dir ../output/local_lm_vae_wiki2_bert_gpt
+    --encoder_model_type bert --encoder_model_name_or_path bert-base-uncased --decoder_model_type
+    gpt2 --decoder_model_name_or_path gpt2 --train_data_file ../data/datasets/wikitext-2/train.txt
+    --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir
+    --per_gpu_train_batch_size 1
+  id: application_1569487816036_0398
+  results_dir: /mnt/_output/pt-results/2019-09-26/application_1569487816036_0398
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      use_phillyfs: false
+      container_name: bigtextae
+      storage_account_name: textae
+      mount_path: /mnt/_default
+    _output:
+      use_phillyfs: false
+      container_name: bigtextae
+      storage_account_name: textae
+      mount_path: /mnt/_output
diff --git a/Optimus/code/scripts/scripts_philly/results/eval_vae_wiki2_beta/philly.yaml b/Optimus/code/scripts/scripts_philly/results/eval_vae_wiki2_beta/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..eedcc9afbce2a20d26dfea8aa5e3e4976dc981ae
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/eval_vae_wiki2_beta/philly.yaml
@@ -0,0 +1,148 @@
+version: 4.1.8
+dry_run: false
+exp_name: eval_vae_wiki2_beta
+description: Evaluate VAE on Wiki2 Dataset
+timestamp: '2019-09-28T09:58:48.231451-07:00'
+auth:
+  cluster: eu2
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 6
+  - name: beta_option
+    spec: discrete
+    values:
+    - 0.0
+    - 0.25
+    - 0.5
+    - 0.75
+    - 1.0
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}_b_{beta_option:.2f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta {beta_option}
+      --per_gpu_train_batch_size {bs_option} --output_dir ../output/philly_clm_wiki2_{beta_option}
+      --encoder_model_type bert --encoder_model_name_or_path bert-base-uncased --decoder_model_type
+      gpt2 --decoder_model_name_or_path gpt2 --train_data_file ../data/datasets/wikitext-2/train.txt
+      --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size
+      1 --overwrite_output_dir  --save_steps 200 --logging_steps 100
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_eval_vae_wiki2_beta_6_b_0.00_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.0 --per_gpu_train_batch_size
+    6 --output_dir ../output/philly_clm_wiki2_0.0 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size 1 --overwrite_output_dir  --save_steps
+    200 --logging_steps 100
+  id: application_1568928610179_4519
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4519
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_eval_vae_wiki2_beta_6_b_0.50_abce
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.5 --per_gpu_train_batch_size
+    6 --output_dir ../output/philly_clm_wiki2_0.5 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size 1 --overwrite_output_dir  --save_steps
+    200 --logging_steps 100
+  id: application_1568928610179_4518
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4518
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_eval_vae_wiki2_beta_6_b_0.75_abcg
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.75 --per_gpu_train_batch_size
+    6 --output_dir ../output/philly_clm_wiki2_0.75 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size 1 --overwrite_output_dir  --save_steps
+    200 --logging_steps 100
+  id: application_1568928610179_4520
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4520
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_eval_vae_wiki2_beta_6_b_1.00_abch
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 1.0 --per_gpu_train_batch_size
+    6 --output_dir ../output/philly_clm_wiki2_1.0 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size 1 --overwrite_output_dir  --save_steps
+    200 --logging_steps 100
+  id: application_1568928610179_4522
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4522
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_eval_vae_wiki2_beta_6_b_0.25_abcf
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.25 --per_gpu_train_batch_size
+    6 --output_dir ../output/philly_clm_wiki2_0.25 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size 1 --overwrite_output_dir  --save_steps
+    200 --logging_steps 100
+  id: application_1568928610179_4521
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4521
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      mount_path: /mnt/_default
+      storage_account_name: textae
+      use_phillyfs: false
+      container_name: bigtextae
+    _output:
+      mount_path: /mnt/_output
+      storage_account_name: textae
+      use_phillyfs: false
+      container_name: bigtextae
diff --git a/Optimus/code/scripts/scripts_philly/results/train_clm_wiki103/philly.yaml b/Optimus/code/scripts/scripts_philly/results/train_clm_wiki103/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..a398d07bc45fec7258f6ae421fff96d97b44b763
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/train_clm_wiki103/philly.yaml
@@ -0,0 +1,66 @@
+version: 4.1.8
+dry_run: false
+exp_name: train_clm_wiki103
+description: Train AE on Wiki 103 Dataset
+timestamp: '2019-09-21T16:04:04.022070-07:00'
+auth:
+  cluster: cam
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 2
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.1f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_wiki103
+      --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/wikitext-103/train.txt
+      --do_eval --eval_data_file ../data/datasets/wikitext-103/valid.txt --per_gpu_train_batch_size
+      {bs_option}  --save_steps 500 --overwrite_output_dir
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_train_clm_wiki103_2.0_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_wiki103
+    --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/wikitext-103/train.txt
+    --do_eval --eval_data_file ../data/datasets/wikitext-103/valid.txt --per_gpu_train_batch_size
+    2  --save_steps 500 --overwrite_output_dir
+  id: application_1569000762026_0384
+  results_dir: /mnt/_output/pt-results/2019-09-21/application_1569000762026_0384
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      use_phillyfs: false
+      mount_path: /mnt/_default
+      container_name: bigtextae
+      storage_account_name: textae
+    _output:
+      use_phillyfs: false
+      mount_path: /mnt/_output
+      container_name: bigtextae
+      storage_account_name: textae
diff --git a/Optimus/code/scripts/scripts_philly/results/train_clm_wiki2/philly.yaml b/Optimus/code/scripts/scripts_philly/results/train_clm_wiki2/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..d3a2538cb1a599549e18999bd588c6fe63acef18
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/train_clm_wiki2/philly.yaml
@@ -0,0 +1,70 @@
+version: 4.1.8
+dry_run: false
+exp_name: train_clm_wiki2
+description: Train AE on Wiki2 Dataset
+timestamp: '2019-09-20T14:43:52.068058-07:00'
+auth:
+  cluster: eu2
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 2
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.1f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+    - export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_wiki2
+      --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/wikitext-2/train.txt
+      --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size
+      {bs_option}
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_train_clm_wiki2_2.0_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+  - export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+  - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_wiki2 --model_type
+    gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/wikitext-2/train.txt
+    --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size
+    2
+  id: application_1568928610179_0407
+  results_dir: /mnt/_output/pt-results/2019-09-20/application_1568928610179_0407
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      storage_account_name: textae
+      mount_path: /mnt/_default
+      container_name: bigtextae
+      use_phillyfs: false
+    _output:
+      storage_account_name: textae
+      mount_path: /mnt/_output
+      container_name: bigtextae
+      use_phillyfs: false
diff --git a/Optimus/code/scripts/scripts_philly/results/train_clm_yahoo/philly.yaml b/Optimus/code/scripts/scripts_philly/results/train_clm_yahoo/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..a9c0abcf3d05c6e4d5014fde2ee2ebf65ada6b51
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/train_clm_yahoo/philly.yaml
@@ -0,0 +1,66 @@
+version: 4.1.8
+dry_run: false
+exp_name: train_clm_yahoo
+description: Train Causal on Yahoo Dataset
+timestamp: '2019-09-20T21:58:13.558693-07:00'
+auth:
+  cluster: cam
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 1
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_yahoo_gpt2
+      --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/yahoo_data/train.txt
+      --do_eval --eval_data_file ../data/datasets/yahoo_data/valid.txt --per_gpu_train_batch_size
+      {bs_option}
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_train_clm_yahoo_1_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_yahoo_gpt2
+    --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/yahoo_data/train.txt
+    --do_eval --eval_data_file ../data/datasets/yahoo_data/valid.txt --per_gpu_train_batch_size
+    1
+  id: application_1569000762026_0092
+  results_dir: /mnt/_output/pt-results/2019-09-20/application_1569000762026_0092
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      mount_path: /mnt/_default
+      container_name: bigtextae
+      use_phillyfs: false
+      storage_account_name: textae
+    _output:
+      mount_path: /mnt/_output
+      container_name: bigtextae
+      use_phillyfs: false
+      storage_account_name: textae
diff --git a/Optimus/code/scripts/scripts_philly/results/train_clm_yelp/philly.yaml b/Optimus/code/scripts/scripts_philly/results/train_clm_yelp/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..301a4fb69bdd37678f3b4e0d5181919d535c0b0e
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/train_clm_yelp/philly.yaml
@@ -0,0 +1,66 @@
+version: 4.1.8
+dry_run: false
+exp_name: train_clm_yelp
+description: Train Causal LM on Yelp Dataset
+timestamp: '2019-09-20T23:20:36.985656-07:00'
+auth:
+  cluster: cam
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 1
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_yelp_gpt2
+      --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/yelp_data/train.txt
+      --do_eval --eval_data_file ../data/datasets/yelp_data/valid.txt --per_gpu_train_batch_size
+      {bs_option} --overwrite_output_dir
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_train_clm_yelp_1_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_yelp_gpt2
+    --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/yelp_data/train.txt
+    --do_eval --eval_data_file ../data/datasets/yelp_data/valid.txt --per_gpu_train_batch_size
+    1 --overwrite_output_dir
+  id: application_1569000762026_0101
+  results_dir: /mnt/_output/pt-results/2019-09-20/application_1569000762026_0101
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      use_phillyfs: false
+      storage_account_name: textae
+      mount_path: /mnt/_default
+      container_name: bigtextae
+    _output:
+      use_phillyfs: false
+      storage_account_name: textae
+      mount_path: /mnt/_output
+      container_name: bigtextae
diff --git a/Optimus/code/scripts/scripts_philly/results/train_mlm_wiki2/philly.yaml b/Optimus/code/scripts/scripts_philly/results/train_mlm_wiki2/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..1a16f8069b2effbb0818cd81bf52f8fd75b5a921
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/train_mlm_wiki2/philly.yaml
@@ -0,0 +1,66 @@
+version: 4.1.8
+dry_run: false
+exp_name: train_mlm_wiki2
+description: Train Masked LM on Wiki2 Dataset
+timestamp: '2019-09-20T16:50:00.802031-07:00'
+auth:
+  cluster: eu2
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 4
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_mlm_wiki2
+      --model_type roberta --model_name_or_path roberta-base  --do_train --train_data_file
+      ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt
+      --per_gpu_train_batch_size {bs_option} --mlm
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_train_mlm_wiki2_4_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - python examples/run_lm_finetuning.py --output_dir ../output/philly_mlm_wiki2 --model_type
+    roberta --model_name_or_path roberta-base  --do_train --train_data_file ../data/datasets/wikitext-2/train.txt
+    --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size
+    4 --mlm
+  id: application_1568928610179_0426
+  results_dir: /mnt/_output/pt-results/2019-09-20/application_1568928610179_0426
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      storage_account_name: textae
+      mount_path: /mnt/_default
+      use_phillyfs: false
+      container_name: bigtextae
+    _output:
+      storage_account_name: textae
+      mount_path: /mnt/_output
+      use_phillyfs: false
+      container_name: bigtextae
diff --git a/Optimus/code/scripts/scripts_philly/results/train_vae_penn/philly.yaml b/Optimus/code/scripts/scripts_philly/results/train_vae_penn/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..76d6173e5777c1c09794ef15304d0e5be3240111
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/train_vae_penn/philly.yaml
@@ -0,0 +1,95 @@
+version: 4.1.8
+dry_run: false
+exp_name: train_vae_penn
+description: Train VAE on PTB Dataset
+timestamp: '2020-03-31T07:28:00.282060+00:00'
+auth:
+  cluster: eu2
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /data/home/chunyl/azure_mounts/optimus_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 50
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 4
+  - name: beta_option
+    spec: discrete
+    values:
+    - 0.0
+  - name: dim_target_kl_option
+    spec: discrete
+    values:
+    - 0.1
+  - name: ratio_zero_option
+    spec: discrete
+    values:
+    - 0.5
+  - name: ratio_increase_option
+    spec: discrete
+    values:
+    - 0.25
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+      1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero
+      {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Penn
+      --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size
+      100 --output_dir ../output/LM/Penn/philly_vae_penn_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option}
+      --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type
+      gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt
+      --do_eval --eval_data_file ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps
+      2000 --logging_steps 100
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: exp_train_vae_penn_b4_beta_0.00_d_0.10_r0_0.50_ra_0.25_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user azure
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+    1.0 --beta 0.0 --dim_target_kl 0.1 --ratio_zero 0.5 --ratio_increase 0.25 --dataset
+    Penn --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 1 --block_size 100
+    --output_dir ../output/LM/Penn/philly_vae_penn_b0.0_d0.1_r00.5_ra0.25 --encoder_model_type
+    bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path
+    gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file
+    ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps
+    100
+  id: application_1583307153868_8818
+  results_dir: /mnt/_output/pt-results/2020-03-31/application_1583307153868_8818
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      mount_path: /mnt/_default
+      container_name: optimus
+      use_phillyfs: false
+      storage_account_name: textae
+    _output:
+      mount_path: /mnt/_output
+      container_name: optimus
+      use_phillyfs: false
+      storage_account_name: textae
diff --git a/Optimus/code/scripts/scripts_philly/results/train_vae_penn_ft/philly.yaml b/Optimus/code/scripts/scripts_philly/results/train_vae_penn_ft/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..e9327cc20ef9ec59bcadb075fbdde3a5e2a7b12c
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/train_vae_penn_ft/philly.yaml
@@ -0,0 +1,186 @@
+version: 4.1.8
+dry_run: false
+exp_name: train_vae_penn_ft
+description: Train VAE on PTB Dataset
+timestamp: '2020-04-03T21:35:54.650274+00:00'
+auth:
+  cluster: eu2
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /data/home/chunyl/azure_mounts/optimus_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 50
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 4
+  - name: beta_option
+    spec: discrete
+    values:
+    - 0.0
+  - name: dim_target_kl_option
+    spec: discrete
+    values:
+    - 0.05
+    - 0.1
+    - 0.25
+    - 0.5
+    - 1
+  - name: ratio_zero_option
+    spec: discrete
+    values:
+    - 0.5
+  - name: ratio_increase_option
+    spec: discrete
+    values:
+    - 0.25
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+      1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero
+      {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Penn
+      --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size
+      100 --output_dir ../output/philly_vae_penn_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option}
+      --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type
+      gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt
+      --do_eval --eval_data_file ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps
+      2000 --logging_steps 100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir
+      ../output/pretrain/philly_rr1_vc21_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32
+      --gloabl_step_eval 200000
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: exp_train_vae_penn_ft_b4_beta_0.00_d_0.50_r0_0.50_ra_0.25_abch
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user azure
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+    1.0 --beta 0.0 --dim_target_kl 0.5 --ratio_zero 0.5 --ratio_increase 0.25 --dataset
+    Penn --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 1 --block_size 100
+    --output_dir ../output/philly_vae_penn_b0.0_d0.5_r00.5_ra0.25 --encoder_model_type
+    bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path
+    gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file
+    ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps
+    100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir ../output/pretrain/philly_rr1_vc21_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32
+    --gloabl_step_eval 200000
+  id: application_1583307153868_10015
+  results_dir: /mnt/_output/pt-results/2020-04-03/application_1583307153868_10015
+  submit_args: {}
+  tags: []
+  type: bash
+- name: exp_train_vae_penn_ft_b4_beta_0.00_d_0.25_r0_0.50_ra_0.25_abce
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user azure
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+    1.0 --beta 0.0 --dim_target_kl 0.25 --ratio_zero 0.5 --ratio_increase 0.25 --dataset
+    Penn --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 1 --block_size 100
+    --output_dir ../output/philly_vae_penn_b0.0_d0.25_r00.5_ra0.25 --encoder_model_type
+    bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path
+    gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file
+    ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps
+    100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir ../output/pretrain/philly_rr1_vc21_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32
+    --gloabl_step_eval 200000
+  id: application_1583307153868_10014
+  results_dir: /mnt/_output/pt-results/2020-04-03/application_1583307153868_10014
+  submit_args: {}
+  tags: []
+  type: bash
+- name: exp_train_vae_penn_ft_b4_beta_0.00_d_0.10_r0_0.50_ra_0.25_abcg
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user azure
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+    1.0 --beta 0.0 --dim_target_kl 0.1 --ratio_zero 0.5 --ratio_increase 0.25 --dataset
+    Penn --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 1 --block_size 100
+    --output_dir ../output/philly_vae_penn_b0.0_d0.1_r00.5_ra0.25 --encoder_model_type
+    bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path
+    gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file
+    ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps
+    100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir ../output/pretrain/philly_rr1_vc21_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32
+    --gloabl_step_eval 200000
+  id: application_1583307153868_10017
+  results_dir: /mnt/_output/pt-results/2020-04-03/application_1583307153868_10017
+  submit_args: {}
+  tags: []
+  type: bash
+- name: exp_train_vae_penn_ft_b4_beta_0.00_d_0.05_r0_0.50_ra_0.25_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user azure
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+    1.0 --beta 0.0 --dim_target_kl 0.05 --ratio_zero 0.5 --ratio_increase 0.25 --dataset
+    Penn --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 1 --block_size 100
+    --output_dir ../output/philly_vae_penn_b0.0_d0.05_r00.5_ra0.25 --encoder_model_type
+    bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path
+    gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file
+    ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps
+    100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir ../output/pretrain/philly_rr1_vc21_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32
+    --gloabl_step_eval 200000
+  id: application_1583307153868_10018
+  results_dir: /mnt/_output/pt-results/2020-04-03/application_1583307153868_10018
+  submit_args: {}
+  tags: []
+  type: bash
+- name: exp_train_vae_penn_ft_b4_beta_0.00_d_1.00_r0_0.50_ra_0.25_abcf
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user azure
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs
+    1.0 --beta 0.0 --dim_target_kl 1 --ratio_zero 0.5 --ratio_increase 0.25 --dataset
+    Penn --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 1 --block_size 100
+    --output_dir ../output/philly_vae_penn_b0.0_d1_r00.5_ra0.25 --encoder_model_type
+    bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path
+    gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file
+    ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps
+    100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir ../output/pretrain/philly_rr1_vc21_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32
+    --gloabl_step_eval 200000
+  id: application_1583307153868_10016
+  results_dir: /mnt/_output/pt-results/2020-04-03/application_1583307153868_10016
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      mount_path: /mnt/_default
+      storage_account_name: textae
+      container_name: optimus
+      use_phillyfs: false
+    _output:
+      mount_path: /mnt/_output
+      storage_account_name: textae
+      container_name: optimus
+      use_phillyfs: false
diff --git a/Optimus/code/scripts/scripts_philly/results/vae_wiki2_beta/philly.yaml b/Optimus/code/scripts/scripts_philly/results/vae_wiki2_beta/philly.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..1ec3ee0801723795622da8984b1590d4823c9263
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/results/vae_wiki2_beta/philly.yaml
@@ -0,0 +1,148 @@
+version: 4.1.8
+dry_run: false
+exp_name: vae_wiki2_beta
+description: Train VAE on Wiki2 Dataset
+timestamp: '2019-09-28T00:40:47.673194-07:00'
+auth:
+  cluster: eu2
+  vc: msrlabs
+  docker:
+    registry: index.docker.io
+    image: chunyl/pytorch-transformers:v0
+code:
+  local_dir: /home/chunyl/azure_mounts/textae_azure/code/scripts/scripts_philly/code
+  remote_dir: code/
+  code_zip: false
+  storage_id: _default
+data:
+  storage_id: _default
+search:
+  type: grid
+  max_trials: 20
+  params:
+  - name: bs_option
+    spec: discrete
+    values:
+    - 4
+  - name: beta_option
+    spec: discrete
+    values:
+    - 0.0
+    - 0.25
+    - 0.5
+    - 0.75
+    - 1.0
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}_b_{beta_option:.2f}
+    sku: G4
+    sku_count: 1
+    command:
+    - pip install --user --editable .
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta {beta_option}
+      --per_gpu_train_batch_size {bs_option} --output_dir ../output/philly_clm_wiki2_{beta_option}
+      --encoder_model_type bert --encoder_model_name_or_path bert-base-uncased --decoder_model_type
+      gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/wikitext-2/train.txt
+      --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir   --save_steps
+      200 --logging_steps 100
+    submit_args: {}
+    tags: []
+    type: bash
+jobs:
+- name: vq_vae_wiki2_beta_4_b_1.00_abcg
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 1.0 --per_gpu_train_batch_size
+    4 --output_dir ../output/philly_clm_wiki2_1.0 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir   --save_steps 200
+    --logging_steps 100
+  id: application_1568928610179_4442
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4442
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_vae_wiki2_beta_4_b_0.50_abch
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.5 --per_gpu_train_batch_size
+    4 --output_dir ../output/philly_clm_wiki2_0.5 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir   --save_steps 200
+    --logging_steps 100
+  id: application_1568928610179_4444
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4444
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_vae_wiki2_beta_4_b_0.25_abcd
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.25 --per_gpu_train_batch_size
+    4 --output_dir ../output/philly_clm_wiki2_0.25 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir   --save_steps 200
+    --logging_steps 100
+  id: application_1568928610179_4443
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4443
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_vae_wiki2_beta_4_b_0.75_abce
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.75 --per_gpu_train_batch_size
+    4 --output_dir ../output/philly_clm_wiki2_0.75 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir   --save_steps 200
+    --logging_steps 100
+  id: application_1568928610179_4445
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4445
+  submit_args: {}
+  tags: []
+  type: bash
+- name: vq_vae_wiki2_beta_4_b_0.00_abcf
+  sku: G4
+  sku_count: 1
+  command:
+  - pip install --user --editable .
+  - pip install --user tqdm
+  - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta 0.0 --per_gpu_train_batch_size
+    4 --output_dir ../output/philly_clm_wiki2_0.0 --encoder_model_type bert --encoder_model_name_or_path
+    bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2
+    --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file
+    ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir   --save_steps 200
+    --logging_steps 100
+  id: application_1568928610179_4446
+  results_dir: /mnt/_output/pt-results/2019-09-28/application_1568928610179_4446
+  submit_args: {}
+  tags: []
+  type: bash
+storage:
+  info:
+    _default:
+      container_name: bigtextae
+      use_phillyfs: false
+      mount_path: /mnt/_default
+      storage_account_name: textae
+    _output:
+      container_name: bigtextae
+      use_phillyfs: false
+      mount_path: /mnt/_output
+      storage_account_name: textae
diff --git a/Optimus/code/scripts/scripts_philly/train_clm_snli.yaml b/Optimus/code/scripts/scripts_philly/train_clm_snli.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..6fbdcea5b75a01d2cf62e4f38fc8d5b4c6671898
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_clm_snli.yaml
@@ -0,0 +1,47 @@
+description: Train Causal LM on Snli Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6 etc.). Everyone has access to "pnrsy".
+  vc: resrchprojvc6
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr1, rr2, eu2, eu1 et1 
+  cluster: rr1
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: gpt2_{experiment_name:s}_{bs_option:.0f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - python examples/big_ae/run_lm_finetuning_baseline.py --output_dir ../output/philly_clm_snli_20epoch_gpt2 --num_train_epochs 20.0 --dataset Snli --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/snli_data/train.txt --do_eval --eval_data_file ../data/datasets/snli_data/test.txt --per_gpu_train_batch_size {bs_option} --block_size 100 --overwrite_output_dir
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # [top,bottom]
diff --git a/Optimus/code/scripts/scripts_philly/train_clm_wiki103.yaml b/Optimus/code/scripts/scripts_philly/train_clm_wiki103.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..dde389c4a04eb15dc311e1a7b8d4783a8b62eaa3
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_clm_wiki103.yaml
@@ -0,0 +1,47 @@
+description: Train AE on Wiki 103 Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: cam
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.1f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_wiki103 --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/wikitext-103/train.txt --do_eval --eval_data_file ../data/datasets/wikitext-103/valid.txt --per_gpu_train_batch_size {bs_option}  --save_steps 500 --overwrite_output_dir
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [2] # [top,bottom]
diff --git a/Optimus/code/scripts/scripts_philly/train_clm_wiki2.yaml b/Optimus/code/scripts/scripts_philly/train_clm_wiki2.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..7f692da28598998c6e7d2c45730bed596e215a1a
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_clm_wiki2.yaml
@@ -0,0 +1,49 @@
+description: Train AE on Wiki2 Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.1f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - export TRAIN_FILE=../data/datasets/wikitext-2/train.txt
+    - export TEST_FILE=../data/datasets/wikitext-2/valid.txt
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_clm_wiki2 --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size {bs_option}
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [2] # [top,bottom]
diff --git a/Optimus/code/scripts/scripts_philly/train_clm_yahoo.yaml b/Optimus/code/scripts/scripts_philly/train_clm_yahoo.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..06cd12b9eae4c2fae906d6f27335cd13502fbdc7
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_clm_yahoo.yaml
@@ -0,0 +1,47 @@
+description: Train Causal on Yahoo Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: resrchprojvc6
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: rr1 # eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: gpt2_{experiment_name:s}_{bs_option:.0f}
+    sku: G8 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - python examples/big_ae/run_lm_finetuning_baseline.py --output_dir ../output/philly_clm_yahoo_gpt2 --dataset Yahoo --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/yahoo_data/train.txt --do_eval --eval_data_file ../data/datasets/yahoo_data/valid.txt --overwrite_output_dir  --per_gpu_train_batch_size {bs_option}
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # [top,bottom]
diff --git a/Optimus/code/scripts/scripts_philly/train_clm_yelp.yaml b/Optimus/code/scripts/scripts_philly/train_clm_yelp.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..bb50fde9fa57c156d6ef61f454b215973e40ba67
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_clm_yelp.yaml
@@ -0,0 +1,47 @@
+description: Train Causal LM on Yelp Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: resrchprojvc6
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: rr1 # eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: gpt2_{experiment_name:s}_{bs_option:.0f}
+    sku: G8 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - python examples/big_ae/run_lm_finetuning_baseline.py --output_dir ../output/philly_clm_yelp_gpt2 --dataset Yelp --model_type gpt2 --model_name_or_path gpt2  --do_train --train_data_file ../data/datasets/yelp_data/train.txt --do_eval --eval_data_file ../data/datasets/yelp_data/test.txt --per_gpu_train_batch_size {bs_option} --overwrite_output_dir
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [3] # [top,bottom]
diff --git a/Optimus/code/scripts/scripts_philly/train_mlm_wiki2.yaml b/Optimus/code/scripts/scripts_philly/train_mlm_wiki2.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..d58f6323cc274b5ae3a1c680e8e1665ee0f9e28c
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_mlm_wiki2.yaml
@@ -0,0 +1,47 @@
+description: Train Masked LM on Wiki2 Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - python examples/run_lm_finetuning.py --output_dir ../output/philly_mlm_wiki2 --model_type roberta --model_name_or_path roberta-base  --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --per_gpu_train_batch_size {bs_option} --mlm
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # [top,bottom]
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_penn.yaml b/Optimus/code/scripts/scripts_philly/train_vae_penn.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..52f4f8e2c6e1675a3e08b4cc3f62c555f3c68a8b
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_penn.yaml
@@ -0,0 +1,61 @@
+description: Train VAE on PTB Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc21, msrlabspvc11, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: pytorch-transformers:v1
+    # registry: chunylregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: optimus
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs 1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Penn --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 100 --output_dir ../output/LM/Penn/philly_vae_penn_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps 100
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [0.1] # [0.01,0.05,0.1,0.25,0.5,1] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [0.5] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.25] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_penn_ft.yaml b/Optimus/code/scripts/scripts_philly/train_vae_penn_ft.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..604afec6b1a6faabf6b0ae3df22f4f236e4223ae
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_penn_ft.yaml
@@ -0,0 +1,61 @@
+description: Train VAE on PTB Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, msrlabspvc11, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: pytorch-transformers:v1
+    # registry: chunylregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: optimus
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs 1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Penn --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 100 --output_dir ../output/philly_vae_penn_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/penn/train.txt --do_eval --eval_data_file ../data/datasets/penn/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps 100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir ../output/pretrain/philly_rr1_vc21_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32 --gloabl_step_eval 200000
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [0.05,0.1,0.25,0.5,1] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [0.5] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.25] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_snli.yaml b/Optimus/code/scripts/scripts_philly/train_vae_snli.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..9a5d20a61ee58469676e25d401fdddacd03ddf1f
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_snli.yaml
@@ -0,0 +1,61 @@
+description: Train VAE on SNLI Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs 20.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Snli --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 100 --output_dir ../output/philly_vae_snli_epoch20_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/snli_data/train.txt --do_eval --eval_data_file ../data/datasets/snli_data/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps 100
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [10] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.25,1.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [0.01,0.05,0.25,0.5,1] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [0.5] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.25] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_wiki2.yaml b/Optimus/code/scripts/scripts_philly/train_vae_wiki2.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..fbbd1742581d4f5ef1eeff0736c17db227e24ca8
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_wiki2.yaml
@@ -0,0 +1,51 @@
+description: Train VAE on Wiki2 Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: resrchprojvc6
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: rr1
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: vq_{experiment_name:s}_{bs_option:.0f}_b_{beta_option:.2f}
+    sku: G8 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --beta {beta_option} --per_gpu_train_batch_size {bs_option} --output_dir ../output/philly_clm_wiki2_{beta_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-uncased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/wikitext-2/train.txt --do_eval --eval_data_file ../data/datasets/wikitext-2/valid.txt --overwrite_output_dir  --save_steps 400 --logging_steps 100
+  max_trials: 20
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # [top,bottom]
+    - name: beta_option
+      spec: discrete
+      values: [0.0,0.25,0.5,0.75,1.0] # [top,bottom]
\ No newline at end of file
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_wikipedia.yaml b/Optimus/code/scripts/scripts_philly/train_vae_wikipedia.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..5ffe88c308f133bd975f5ae90f498d4e01d8fad6
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_wikipedia.yaml
@@ -0,0 +1,58 @@
+description: Train AE on Wikipedia Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v1
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - python examples/big_ae/run_lm_vae_pretraining.py --use_philly --num_train_epochs 20.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset wikipedia --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 128 --output_dir ../output/philly_vae_wikipedia_pretraining_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/wikipedia_json --overwrite_output_dir  --save_steps 10000 --logging_steps 100
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [1.0] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [1.0] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.1] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_wikipedia_distributed.yaml b/Optimus/code/scripts/scripts_philly/train_vae_wikipedia_distributed.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..6892191af3d151ff57d38e02f9eba7fc63498056
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_wikipedia_distributed.yaml
@@ -0,0 +1,61 @@
+description: Distributed Train AE on Wikipedia Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: resrchprojvc7
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: rr1
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G8 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python -m torch.distributed.launch --nproc_per_node 8 examples/big_ae/run_lm_vae_pretraining_distributed.py --use_philly --num_train_epochs 1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset wikipedia --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 128 --output_dir ../output/philly_rr1_vae_wikipedia_pretraining_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/wikipedia_json_64/ --overwrite_output_dir  --save_steps 20000 --logging_steps 100
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [16] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [1.0] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [1.0] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.1] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_wikipedia_distributed_eu2.yaml b/Optimus/code/scripts/scripts_philly/train_vae_wikipedia_distributed_eu2.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..b4278c8a8ac7c7819048f9dd80b0814d7fe387da
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_wikipedia_distributed_eu2.yaml
@@ -0,0 +1,61 @@
+description: Distributed Train AE on Wikipedia Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python -m torch.distributed.launch --nproc_per_node 4 examples/big_ae/run_lm_vae_pretraining_distributed.py --use_philly --num_train_epochs 1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset wikipedia --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 128 --output_dir ../output/philly_eu2_vae_wikipedia_pretraining_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/wikipedia_json_64/ --overwrite_output_dir  --save_steps 20000 --logging_steps 100
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [12] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [1.0] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [1.0] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.1] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_yahoo.yaml b/Optimus/code/scripts/scripts_philly/train_vae_yahoo.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..5e4281c80e66e68bdefc96fca207366b37282d1e
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_yahoo.yaml
@@ -0,0 +1,61 @@
+description: Train VAE on Yahoo Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: msrlabspvc11
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs 1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Yahoo --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 512 --output_dir ../output/philly_vae_yahoo_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/yahoo_data/train.txt --do_eval --eval_data_file ../data/datasets/yahoo_data/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps 100
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # 
+    - name: beta_option
+      spec: discrete
+      values: [0.25,1.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [0.01,0.05,0.25,0.5,1.0] # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [0.5] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.25] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vae_yelp.yaml b/Optimus/code/scripts/scripts_philly/train_vae_yelp.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..dbe7904e7660839c2129185811e8293185263a79
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vae_yelp.yaml
@@ -0,0 +1,77 @@
+description: Train VAE on Yelp Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, resrchprojvc6, etc.). Everyone has access to "pnrsy".
+  vc: resrchprojvc7
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: rr1
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: chunyl/pytorch-transformers:v0
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: textae
+    container_name: bigtextae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: code/
+  local_dir: $CONFIG_DIR/code
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: exp_{experiment_name:s}_b{bs_option:.0f}_beta_{beta_option:.2f}_d_{dim_target_kl_option:.2f}_r0_{ratio_zero_option:.2f}_ra_{ratio_increase_option:.2f}
+    sku: G4 # G4 # G1
+    command:
+    - pip install --user --editable .
+    - pip install --user azure    
+    - pip install --user tqdm
+    - python examples/big_ae/run_lm_vae_training.py --use_philly --num_train_epochs 1.0 --beta {beta_option} --dim_target_kl {dim_target_kl_option} --ratio_zero {ratio_zero_option} --ratio_increase {ratio_increase_option} --dataset Yelp --per_gpu_train_batch_size {bs_option} --per_gpu_eval_batch_size 1 --block_size 300 --output_dir ../output/philly_vae_yelp_b{beta_option}_d{dim_target_kl_option}_r0{ratio_zero_option}_ra{ratio_increase_option} --encoder_model_type bert --encoder_model_name_or_path bert-base-cased --decoder_model_type gpt2 --decoder_model_name_or_path gpt2 --do_train --train_data_file ../data/datasets/yelp_data/train.txt --do_eval --eval_data_file ../data/datasets/yelp_data/test.txt --overwrite_output_dir  --save_steps 2000 --logging_steps 100  --use_pretrained_model --use_pretrained_vae --checkpoint_dir ../output/philly_rr3scl_g8_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25 --gloabl_step_eval 760000
+  max_trials: 50
+  type: grid
+  params:
+    - name: bs_option
+      spec: discrete
+      values: [4] # 
+    - name: beta_option
+      spec: discrete
+      values: [1.0] # 
+    - name: dim_target_kl_option
+      spec: discrete
+      values: [0.01,0.05,0.25,0.5,1.0]  # 
+    - name: ratio_zero_option
+      spec: discrete
+      values: [0.5] #
+    - name: ratio_increase_option
+      spec: discrete
+      values: [0.25] # 
+
+    # - name: bs_option
+    #   spec: discrete
+    #   values: [4] # 
+    # - name: beta_option
+    #   spec: discrete
+    #   values: [0.25,1.0] # 
+    # - name: dim_target_kl_option
+    #   spec: discrete
+    #   values: [0.01,0.05,0.25,0.5,1] # 
+    # - name: ratio_zero_option
+    #   spec: discrete
+    #   values: [0.5] #
+    # - name: ratio_increase_option
+    #   spec: discrete
+    #   values: [0.25] # 
diff --git a/Optimus/code/scripts/scripts_philly/train_vq_bird.yaml b/Optimus/code/scripts/scripts_philly/train_vq_bird.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..8762d0a5139c83852676e91a92da22ce3ec10502
--- /dev/null
+++ b/Optimus/code/scripts/scripts_philly/train_vq_bird.yaml
@@ -0,0 +1,46 @@
+description: Train VQ on Bird Dataset
+
+auth:
+  # which virtual cluster you belong to (msrlabs, etc.). Everyone has access to "pnrsy".
+  vc: msrlabspvc12 # msrlabs
+  # physical cluster to use (cam, gcr, rr1) or Azure clusters (eu1, eu2, etc.)
+  # cluster: rr2, eu2, eu1 et1 
+  cluster: eu2
+  # docker environment (vm) in which your job will run. we provide "generic" dockers
+  # with the main deep learning toolkit installed (PyTorch, TF, Chainer, etc.)
+  docker:
+    # image: philly/jobs/custom/generic-docker:py27
+    # registry: phillyregistry.azurecr.io
+    image: vlnres/vqvae:v1 # chunyl/vqvae:v2
+    registry: index.docker.io
+
+storage:
+  _default:
+    #use_phillyfs: True
+    storage_account_name: sslm
+    container_name: vqvae
+    mount_path: /mnt/_default
+
+code:
+  # local directory of the code. this will be uploaded to the server.
+  # $CONFIG_DIR is expanded to the directory of this config file
+  code_upload: False
+  remote_dir: vq-vae-2-pytorch/
+  local_dir: $CONFIG_DIR/src
+
+#data:
+  # data upload is not required for this example
+  #data_upload: False
+
+search:
+  job_template:
+    name: vq_{experiment_name:s}_{image_size_option:.1f}
+    sku: G4 # G4 # G1
+    command:
+    - python train_vqvae.py --philly --dataset_name bird --size {image_size_option} --batch 512
+  max_trials: 20
+  type: grid
+  params:
+    - name: image_size_option
+      spec: discrete
+      values: [64,128] # [top,bottom]
diff --git a/Optimus/code/twitter_prompts.csv b/Optimus/code/twitter_prompts.csv
new file mode 100644
index 0000000000000000000000000000000000000000..569a05484f4782ecceb5c9a988fd7759aa6e9929
--- /dev/null
+++ b/Optimus/code/twitter_prompts.csv
@@ -0,0 +1,2088 @@
+,0
+0,Persephone
+1,"A portrait: man, whose lineage is corpse."
+2,a beautiful Waluigi
+3,president abe lincoln but a cat
+4,a woman and a crow
+5,"A professional, minimalist poster for the book The Old Man and the Sea"
+6,"half Ryan, half pigeon"
+7,Easter cat
+8,a beautiful woman
+9,a cherry tree made of fractals
+10,a christmas card from the victorian era
+11,The Theotokos is a bird
+12,
+13,A short life full of immense joy
+14,a character from a ghibli movie
+15,A structure made of people standing on top of other people
+16,зеленая собака
+17,a painting of the city
+18,a character from a ghibli movie
+19,pasta ömetabolism
+20,"a brilliant sketch titled ""Let Forever be Delayed"""
+21,the sun is shining on the lake
+22,Monet Lisa
+23,Genesis
+24,Synesthesia
+25,A dead man
+26,a cherry tree made of fractals
+27,a tasteful nude
+28,The First Supper
+29,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+30,a painting of the last day
+31,Dead Codes by Ryan Murdock
+32,Genesis
+33,symmetry
+34,The OLD DATA
+35,a beautiful person
+36,the whitest man
+37,Death is a black camel that kneels down so we can ride
+38,a goblin by van gogh
+39,a portrait of a beautiful person
+40,a famous painted portrait of Lady Macbeth
+41,on the edge of grace
+42,"""A God Made of Wires and Dust"" by Ryan Murdock"
+43,symmetry
+44,a beautiful person
+45,"If we're not careful, it's only art about not-quite-dead pigs from now on."
+46,Beauty here -- a photograph by Ryan Murdock
+47,Hunger art by r.j. Murdock
+48,"A professional, minimalist poster for the film Donnie Darko"
+49,A black and white photo of a rainbow.
+50,a beautiful painting
+51,Monet Lisa
+52,a painting of the city
+53,A structure made of people standing on top of other people
+54,a criminal
+55,a cherry tree made of fractals
+56,Persephone flees Hades
+57,a tree with weaping branches
+58,a tree with weaping branches
+59,Genesis
+60,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+61,a cute cat
+62,Aflame
+63,A cat wearing a tophat
+64,a terrifying night hag
+65,a beautiful woman
+66,Fire
+67,a cherry tree made of fractals
+68,The EcoCathedral
+69,a man on fire
+70,A structure made of people standing on top of other people
+71,totemic dusk
+72,The Death of Achilles
+73,Everywhere is no-place
+74,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+75,An Arundel Tomb
+76,The average Advadnoun twitter follower
+77,I can read when there's writing on the wall
+78,
+79,A Tragedy
+80,Breathe deep the fumes at Delphi
+81,a pOrTRaIT Of tHe SpOngeBOb CHicKen
+82,a portrait of a beautiful person
+83,a beautiful person
+84,a portrait of a beautiful person
+85,Dead Codes by Ryan Murdock
+86,a photo of a purple dog
+87,Memento Mori
+88,"joy, happiness, bliss"
+89,Paradise Lost
+90,a beautiful person
+91,melancholia
+92,Monet Lisa
+93,"Of that which one cannot speak, one must be silent."
+94,
+95,Juliet
+96,God killed Van Gogh.
+97,a cherry tree made of fractals
+98,a horse with four eyes.
+99,a beautiful person
+100,With the Gods in envy of their visions
+101,The Lost Generation
+102,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+103,a portrait of a beautiful person
+104,"half Ryan, half pigeon"
+105,a ginormous baby
+106,a wormhole
+107,Ophelia
+108,"""The hunger artist, full"" by Ryan Murdock"
+109,I will meet you in a field firmly set within wrong.nnBy Ryan Murdock
+110,"Intricate, Weeping Tree by Ryan Murdock"
+111,everything was beautiful and nothing hurt
+112,Saturn being a good dad to his son
+113,The years gild our memoriesnUnfairly.
+114,Intimations of Immortality
+115,meaningless neko ♡♡ neko
+116,chiaroscuro
+117,The Patron Saint of Evil
+118,a portrait of a beautiful person
+119,"Mephisto, shrouded in smoke"
+120,everything was beautiful and nothing hurt
+121,God killed Van Gogh.
+122,a man wearing makeup
+123,Everywhere is no-place
+124,🔴~__��'t �
+125,a beautiful waluigi
+126,a beautiful woman
+127,a portrait of a beautiful person
+128,/
+129,a green doG
+130,Dead Codes by Ryan Murdock
+131,I miss the Spring
+132,
+133,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+134,a woman and a crow
+135,a photo from {my hometown}
+136,Summer's Symphony: Counterpoint and Melody
+137,a cute cat
+138,"God, it's amazing."
+139,a painting of a sycamore in
+140,distinguished leaves decorated
+141,I do not think they'll sing for me
+142,the monet lisa
+143,a portrait of Abraham Lincoln
+144,The average Advadnoun twitter follower
+145,Dancing in the moonlight
+146,Shinji Ikari
+147,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+148,/
+149,is this loss? but it's van gogh
+150,Shinji Ikari
+151,a portrait of Juliet
+152,A sticky-note magnum opus featuring birds
+153,a silent palace
+154,"""A new hope blooms on the long notes of old horns."""
+155,The things I'll take with me
+156,is this loss? but it's van gogh
+157,a beautiful haunting
+158,Summer's Symphony: Counterpoint and Melody
+159,зеленая собака
+160,Last Breath
+161,Last Breath
+162,a cherry tree made of fractals
+163,The Theotokos is a bird
+164,a man holding an apple in one hand
+165,a beautiful person
+166,Monet Lisa
+167,A baroque portrait of Hamlet
+168,A gun killed Van Gogh.
+169,totemic dusk
+170,a portrait of a beautiful person
+171,pasta ömetabolism
+172,a beautiful person
+173,Taylor Swift
+174,colorful rabbits chandelier polaroid
+175,Dancing in the moonlight
+176,I will meet you in a field firmly set within wrong.nnBy Ryan Murdock
+177,symmetry
+178,"""Your mind flls in the gaps"" - by Ryan Murdock"
+179,the moon is a sickle cell
+180,"joy, happiness, bliss"
+181,Beauty here -- a photograph by Ryan Murdock
+182,a beautiful person
+183,a photo of a purple dog
+184,A propaganda poster promoting big chungus
+185,a beautiful person
+186,a tree with weaping branches
+187,A gun killed Van Gogh.
+188,"""A new hope blooms on the long notes of old horns."""
+189,a portrait of Abe Lincoln
+190,"""I love you more than the world can contain in its lonely and ramshackle head."""
+191,a character from a ghibli movie
+192,f*** it market standard rule language – distinguish np tax science research
+193,a portrait of Abe Lincoln
+194,a wholesome clown. Not creepy at all
+195,
+196,a corgi
+197,Easter cat
+198,a portrait of Abraham Lincoln
+199,a person's face
+200,A poster advertising Freudian Psychoanalytics
+201,Dancing in the moonlight
+202,Cat in a teacup
+203,a beautiful person
+204,Summer's Symphony: Counterpoint and Melody
+205,Post-Modern Nouveaux Statue
+206,a famous painted portrait of Lady Macbeth
+207,photosynthesis
+208,a photo of a purple dog
+209,
+210,a photo of Juliet
+211,The Starry Night
+212,Saturn being a good dad to his son
+213,a beautiful person
+214,In smoke and mould the fleshless dead
+215,totemic dusk
+216,a beautiful woman
+217,God killed Van Gogh.
+218,is this loss? but it's van gogh
+219,Nostos
+220,a silent palace
+221,"""The hunger artist, full"" by Ryan Murdock"
+222,a green doG
+223,Weeping Roses
+224,for sale: baby shoes; never worn
+225,a dog eating a cheese burger
+226,a man inside a cage
+227,Contentment at the Disco
+228,a photo from {my hometown}
+229,The EcoCathedral
+230,The OLD DATA
+231,treehouse in the style of studio ghibli animation
+232,
+233,"""The hunger artist, full"" by Ryan Murdock"
+234,
+235,Everywhere is no-place
+236,"A portrait: man, whose lineage is corpse."
+237,Last Breath
+238,A propaganda poster promoting big chungus
+239,зеленая собака
+240,a beautiful person
+241,Memento Mori
+242,A propaganda poster promoting big chungus
+243,is this loss?
+244,a tree with weaping branches
+245,Nostos
+246,Beauty here -- a photograph by Ryan Murdock
+247,a tiny church inside an eyeball
+248,
+249,a cherry tree made of fractals
+250,"joy, happiness, bliss"
+251,The First Supper
+252,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+253,🔴~__��'t �
+254,Dancing in the moonlight
+255,Mona Lisa
+256,"God, it's amazing."
+257,a man holding an apple in one hand
+258,Some stolen Gods take up the reigns of darkness.
+259,🔴~__��'t �
+260,Figure 5: a corgi
+261,a photo from {my hometown}
+262,Anxiety: the one emotion that does not lie
+263,In the temple of God
+264,
+265,Metaphysics
+266,a beautiful woman
+267,a beautiful woman
+268,a surrealist eye
+269,the massive hope nof early iterations
+270,Ophelia
+271,a minimalist painting that you wouldn't understand
+272,Aflame
+273,a christmas card from the victorian era
+274,Dancing in the moonlight
+275,/
+276,"Mephisto, shrouded in smoke"
+277,a beautiful woman
+278,зеленая собака
+279,Easter cat
+280,The Oracle leans forward to say: beware the ides of March
+281,a portrait of a beautiful person
+282,Persephone
+283,a portrait of Abraham Lincoln
+284,the moon is a sickle cell
+285,symmetry
+286,Monet Lisa
+287,Saturn being a good dad to his son
+288,The Monet Lisa
+289,I sold my soul at the crossroads
+290,a beautiful person
+291,A poster advertising Freudian Psychoanalytics
+292,Cat in a teacup
+293,a silent palace
+294,
+295,a beautiful person
+296,
+297,
+298,Super Mario World but every character is Luigi
+299,chiaroscuro
+300,A dead man
+301,pasta ömetabolism
+302,A vanitas still life that features twitter follower counts
+303,slightly mild cosplaying pseudo beard
+304,Monet Lisa
+305,Mona Lisa
+306,handsome commemorative garden pigeon
+307,pasta ömetabolism
+308,"""The hunger artist, full"" by Ryan Murdock"
+309,a gorgeous bouquet with roses and sunflowers
+310,is this loss? but it's van gogh
+311,Memorial
+312,a forest filled with moonlight
+313,Post-Modern Nouveaux Statue
+314,she sings opera
+315,"God closes a door, boards up stained-glass windows."
+316,a dog wearing a suit playing tennis
+317,Intimations of Immortality
+318,
+319,turnt brony undergrad dwight
+320,a famous painted portrait of Lady Macbeth
+321,a cherry tree made of fractals
+322,Weeping Roses
+323,pasta ömetabolism
+324,
+325,
+326,"A portrait: man, whose lineage is corpse."
+327,The average Advadnoun twitter follower
+328,the moon is a sickle cell
+329,A black and white photo of a rainbow.
+330,God killed Van Gogh.
+331,turnt brony undergrad dwight
+332,"a brilliant sketch titled ""Let Forever be Delayed"""
+333,handsome commemorative garden pigeon
+334,a painting of a sycamore in
+335,a professional photo of a cat wearing a party hat
+336,Persephone
+337,Taylor Swift
+338,Homer Simpson
+339,using generated paint
+340,A black and white photo of a rainbow.
+341,meaningless neko ♡♡ neko
+342,is this loss? but it's van gogh
+343,Is this loss?
+344,a man from an anime
+345,the massive hope nof early iterations
+346,a beautiful woman
+347,Post-Modern Nouveaux Statue
+348,photosynthesis
+349,a cherry tree made of fractals
+350,a minimalist painting that you wouldn't understand
+351,a corgi
+352,handsome commemorative garden pigeon
+353,The OLD DATA
+354,cowboy with a trumpet
+355,A short life full of immense joy
+356,a beautiful woman
+357,The end of nothing is eroding. A watercolor by K.
+358,a tasteful nude
+359,symmetry
+360,a portrait of Abraham Lincoln
+361,Last Breath
+362,the eternal dread of lemongrab
+363,vangogh # landscape
+364,a cherry tree made of fractals
+365,The Devil Whispers blood
+366,a silent palace
+367,Paradise Lost
+368,Monet Lisa
+369,Everywhere is no-place
+370,Taylor Swift
+371,"r.j. Murdock's ""The Death of a Hacker"""
+372,a portrait of Abraham Lincoln
+373,I know the end
+374,Persephone
+375,A poster advertising Freudian Psychoanalytics
+376,a beautiful woman
+377,A black and white photo of a rainbow.
+378,the whitest man
+379,the eternal dread of lemongrab
+380,a drawing by an AI
+381,🔴~__��'t �
+382,We haunt the synapses
+383,frogs in the style of Ralph Steadman
+384,a beautiful haunting
+385,photosynthesis
+386,a character from a ghibli movie
+387,A structure made of people standing on top of other people
+388,Intimations of Immortality
+389,a jukebox powered by smoke
+390,beautiful art
+391,In the temple of God
+392,Intimations of Immortality
+393,a beautiful painting
+394,A gun killed Van Gogh.
+395,a man with no eyes
+396,a famous painted portrait of Lady Macbeth
+397,a tasteful nude
+398,a jukebox powered by smoke
+399,a portrait of Juliet
+400,The Patron Saint of Evil
+401,a beautiful Waluigi
+402,a gilded lily
+403,
+404,Kierkegaard on the edge
+405,a beautiful person
+406,Just west of Alpha Centauri
+407,a horse with four eyes.
+408,Good grief
+409,a portrait of a beautiful person
+410,Aflame
+411,a man wearing makeup
+412,a portrait of Abraham Lincoln
+413,a corgi
+414,I do not think they'll sing for me
+415,Intimations of Immortality
+416,A poster serving as a memento mori
+417,Psychology
+418,A gun killed Van Gogh.
+419,"a brilliant sketch titled ""Let Forever be Delayed"""
+420,using generated paint
+421,pasta ömetabolism
+422,a summer day
+423,a gilded lily
+424,a cute cat
+425,on the edge of grace
+426,Art is growing.
+427,Spiderman delivering a pizza
+428,the intersection of art and technology
+429,"""The hunger artist, full"" by Ryan Murdock"
+430,a tarot card
+431,an omen
+432,slightly mild cosplaying pseudo beard
+433,meaningless neko ♡♡ neko
+434,intricate nothing
+435,symmetry
+436,I have no idea what anything in this image is
+437,a photo from {my hometown}
+438,a sad man
+439,face like an M.C. Escher drawing n(you could get lost in its beauty)
+440,A E S T H E T I C ?
+441,totemic dusk
+442,Nostos
+443,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+444,a silent palace
+445,a beautiful painting
+446,"half Ryan, half pigeon"
+447,Weeping Roses
+448,a broken heart
+449,a portrait of Juliet
+450,a painting of the last day
+451,"a brilliant sketch titled ""Let Forever be Delayed"""
+452,a beautiful person
+453,"""The hunger artist, full"" by Ryan Murdock"
+454,a cosmic entity alien with four eyes.
+455,a photo of a purple dog
+456,a summoning
+457,Redacted ████████
+458,a ginormous baby
+459,On the edge of endless darkness
+460,The Fates knit such delicate nooses for us to bind
+461,Theotokos of Milk
+462,A minimalistic still life of a cat sitting on a table
+463,Dancing in the moonlight
+464,a minimalist painting that you wouldn't understand
+465,a beautiful woman
+466,totemic dusk
+467,"Ryan Murdock's ""God haunts the suburbs"""
+468,Dancing in the moonlight
+469,a beautiful woman
+470,a city in Van Gogh's style
+471,"""The hunger artist, full"" by Ryan Murdock"
+472,a person's face
+473,a portrait of <name>
+474,Dancing in the moonlight
+475,a portrait of Persephone
+476,a minimalist painting that you wouldn't understand
+477,a portrait of Abraham Lincoln
+478,Synesthesia
+479,a cute corgi
+480,a portrait of advadnoun
+481,a green doG
+482,a man with no eyes
+483,a cherry tree made of fractals
+484,a ginormous baby
+485,
+486,turnt brony undergrad dwight
+487,"God, it's amazing."
+488,"""The hunger artist, full"" by Ryan Murdock"
+489,We haunt the synapses
+490,God's Eyes are Wired Shut
+491,a famous painted portrait of Lady Macbeth
+492,Juliet
+493,a character from a ghibli movie
+494,the whitest man
+495,a horse with four eyes.
+496,a photo of a purple dog
+497,a beautiful person
+498,The Patron Saint of Hackers
+499,Dead Codes by Ryan Murdock
+500,something trite
+501,beautiful art
+502,
+503,the monet lisa
+504,a cute cat
+505,👉  👈
+506,A propaganda poster promoting big chungus
+507,a beautiful person
+508,a portrait of advadnoun
+509,a cherry tree made of fractals
+510,"It's a meme, I guess"
+511,a person's face
+512,A baroque portrait of Hamlet
+513,a city in Van Gogh's style
+514,"""The hunger artist, full"" by Ryan Murdock"
+515,a man with no eyes
+516,a minimalist painting that you wouldn't understand
+517,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+518,"joy, happiness, bliss"
+519,
+520,"a brilliant sketch titled ""Let Forever be Delayed"""
+521,Last Breath
+522,On the edge of endless darkness
+523,a photo of Juliet
+524,Summer's Symphony: Counterpoint and Melody
+525,Persephone
+526,a green doG
+527,symmetry
+528,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+529,The Starry Night
+530,Genesis
+531,bootleg edgy casual assange
+532,Memento Mori
+533,meaningless neko ♡♡ neko
+534,totemic dusk
+535,Aflame
+536,"""Here lies Ryan Murdock"" -- a memorial with the date and cause of departure."
+537,"""The hunger artist, full"" by Ryan Murdock"
+538,f*** you
+539,a tree with leaves that are amarillo sightseeing thetic
+540,a painting of the last day
+541,"God, it's amazing."
+542,Paradise Lost
+543,a gilded lily
+544,Aflame
+545,a portrait of <name>
+546,a painting that couldn't be sold
+547,a man holding an apple in one hand
+548,"A clock with gorgeous, intricate gradients on it"
+549,a goblin by van gogh
+550,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+551,A vanitas still life that features twitter follower counts
+552,the whitest man
+553,"""The hunger artist, full"" by Ryan Murdock"
+554,is this loss? but it's van gogh
+555,Synesthesia
+556,Aflame
+557,a cherry tree made of fractals
+558,A propaganda poster for daring to eat a peach.
+559,A vanitas still life that features twitter follower counts
+560,the moon is a sickle cell
+561,The Lost Generation
+562,the eternal dread of lemongrab
+563,The First Supper
+564,a character from a ghibli movie
+565,a man on fire
+566,symmetry
+567,pasta ömetabolism
+568,a horse with four eyes.
+569,Metaphysics
+570,Synesthesia
+571,The Fates knit such delicate nooses for us to bind
+572,Knowledge of Good and Evil
+573,meaningless neko ♡♡ neko
+574,A Tragedy
+575,
+576,a drawing by an AI
+577,The Fool tarot card but it's The Lovers
+578,a beautiful person
+579,a silent palace
+580,an omen
+581,"A portrait: man, whose lineage is corpse."
+582,Dancing in the moonlight
+583,a gilded lily
+584,turnt brony undergrad dwight
+585,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+586,totemic dusk
+587,Monet Lisa
+588,fatal skull prose visits bend ntuscan painting underthecomprehend
+589,Monet Lisa
+590,Aflame
+591,an intricate painting Of Eternity by Ryan Murdock
+592,"Intricate, Weeping Tree by Ryan Murdock"
+593,Summer's Symphony: Counterpoint and Melody
+594,Monet Lisa
+595,Last Breath
+596,is this loss? but it's van gogh
+597,"half Ryan, half pigeon"
+598,"God closes a door, boards up the stained-glass windows. nnGod hides."
+599,Everything was beautiful and nothing hurt
+600,"r.j. Murdock's ""The Death of a Hacker"""
+601,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+602,meaningless neko ♡♡ neko
+603,twilight
+604,the sun is shining on the lake
+605,a portrait of a beautiful person
+606,the sun is shining on the lake
+607,
+608,a portrait of Abe Lincoln
+609,A gun killed Van Gogh.
+610,a photo from {my hometown}
+611,The Fool tarot card but it's The Lovers
+612,A structure made of people standing on top of other people
+613,"God closes a door, boards up the stained-glass windows. nnGod hides."
+614,an old man
+615,a beautiful waluigi
+616,is this loss? but it's van gogh
+617,a man standing alone in a wheat field
+618,Aflame
+619,Synesthesia
+620,
+621,Intimations of Immortality
+622,The First Supper
+623,"God, it's amazing."
+624,Persephone
+625,"r.j. Murdock's ""The Death of a Hacker"""
+626,God's Eyes are Wired Shut
+627,Do you remember the mythic beast?nA last-minute cancellation at The Last Supper
+628,f*** it market standard rule language – distinguish np tax science research
+629,totemic dusk
+630,Cat in a teacup
+631,frogs in the style of Ralph Steadman
+632,a beautiful person
+633,The Starry Night
+634,Juliet
+635,turnt brony undergrad dwight
+636,
+637,There is something so interesting about a bleeding edge full of dust.
+638,On the edge of endless darkness
+639,The warrior Achilles devours slain Hector's corpse -- an ink poster by Ryan Murdock
+640,turnt brony undergrad dwight
+641,Intimations of Immortality
+642,a portrait of Abraham Lincoln
+643,a man wearing makeup
+644,a sketch of the mind of god
+645,a man on fire
+646,a portrait of Abraham Lincoln
+647,
+648,The ancient Θωερτυ keyboard of brave Achilles
+649,goes thu extre— dum dum dizzy grimstupiddic ious mindidioirony merely experiment .
+650,"A group portrait featuring the id, ego, and superego"
+651,a photo from {my hometown}
+652,A structure made of people standing on top of other people
+653,a famous painted portrait of Lady Macbeth
+654,ogden
+655,pasta ömetabolism
+656,a tree with weaping branches
+657,photosynthesis
+658,handsome commemorative garden pigeon
+659,a photo of a purple dog
+660,"a brilliant sketch titled ""Let Forever be Delayed"""
+661,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+662,The Death of Achilles
+663,potus mormon lincoln rooster
+664,A black and white photo of a rainbow.
+665,a beautiful haunting
+666,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+667,In the temple of God
+668,a beautiful person
+669,The Patron Saint of Mathematics
+670,a brilliant painting titled
+671,a gilded lily
+672,a tiny church inside an eyeball
+673,a portrait of Juliet
+674,A painting that sold for a million dollars
+675,the moon is a sickle cell
+676,photosynthesis
+677,The Theotokos is a bird
+678,the whitest man
+679,The Monet Lisa
+680,Beauty here -- a photograph by Ryan Murdock
+681,Breathe deep the fumes at Delphi
+682,the sun is shining on the lake
+683,photosynthesis
+684,The things I'll take with me
+685,a green doG
+686,a beautiful person
+687,The years gild our memoriesnUnfairly.
+688,The Lost Generation
+689,a beautiful person
+690,The average Advadnoun twitter follower
+691,a goblin by van gogh
+692,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+693,"A professional, minimalist poster for the book The Old Man and the Sea"
+694,
+695,Cat in a teacup
+696,a beautiful person
+697,beautiful art
+698,I sold my soul at the crossroads
+699,face like an M.C. Escher drawing n(you could get lost in its beauty)
+700,a gorgeous bouquet with roses and sunflowers
+701,a portrait of Abraham Lincoln
+702,Sisyphus
+703,a cute cat
+704,a portrait of <name>
+705,a minimalist painting that you wouldn't understand
+706,a photo of Bernie Sanders sitting on a chair and wearing mittens
+707,a woman and a crow
+708,a character from a ghibli movie
+709,a photo of a purple dog
+710,a dog eating a cheese burger
+711,Last Breath
+712,a sketch of the mind of god
+713,a steampunk technomancer
+714,We haunt the synapses
+715,using generated paint
+716,a cherry tree made of fractals
+717,Saturn being a good dad to his son
+718,oof deeplearning corgi corgi rendering
+719,
+720,Dancing in the moonlight
+721,A Tragedy
+722,A propaganda poster promoting big chungus
+723,A structure made of people standing on top of other people
+724,"A cute, minmimalist valentine's day card featuring a cat"
+725,a cute cat
+726,The skyscraper draws blood -- a landscape
+727,the monet lisa
+728,a photo of a person generating a painting of a person with AI
+729,"""A God Made of Wires and Dust"" by Ryan Murdock"
+730,Monet Lisa
+731,photosynthesis
+732,Hunger art by r.j. Murdock
+733,"""The hunger artist, full"" by Ryan Murdock"
+734,An Arundel Tomb
+735,twilight
+736,"r.j. Murdock's ""The Death of a Hacker"""
+737,living in a den of thieves
+738,"""A new hope blooms on the long notes of old horns."""
+739,"The laptop of brave Achaean Achilles, who would not live long."
+740,a minimalist painting that you wouldn't understand
+741,"Intricate, Weeping Tree by Ryan Murdock"
+742,The Fool
+743,a summoning
+744,pasta ömetabolism
+745,"a brilliant sketch titled ""Let Forever be Delayed"""
+746,a silent palace
+747,The average Advadnoun twitter follower
+748,f*** it market standard rule language – distinguish np tax science research
+749,Monet Lisa
+750,"a brilliant sketch titled ""Let Forever be Delayed"""
+751,meaningless neko ♡♡ neko
+752,"God, it's amazing."
+753,Nostos
+754,Shinji Ikari
+755,a beautiful woman
+756,The Starry Night
+757,hamont parkland avenue incumbscreenshotsaturday hemisphere footage algorithm
+758,a beautiful woman
+759,
+760,Summer always ending
+761,president abe lincoln but a cat
+762,🎷
+763,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+764,a cherry tree made of fractals
+765,A painting that sold for one billion dollars
+766,a man standing alone in a wheat field
+767,symmetry
+768,a broken heart
+769,a silent palace
+770,A vanitas still life that features twitter follower counts
+771,"half Ryan, half pigeon"
+772,"a brilliant sketch titled ""Let Forever be Delayed"""
+773,slightly mild cosplaying pseudo beard
+774,a portrait of <name>
+775,God's Eyes are Wired Shut
+776,she sings opera
+777,a person's face
+778,a cherry tree made of fractals
+779,Dead Codes by Ryan Murdock
+780,Everywhere is no-place
+781,The First Supper
+782,Monet Lisa
+783,A short life full of immense joy
+784,Anxiety: the one emotion that does not lie
+785,Anxiety: the one emotion that does not lie
+786,symmetry
+787,a beautiful waluigi
+788,a goblin by van gogh
+789,"""A new hope blooms on the long notes of old horns."""
+790,Juliet
+791,The OLD DATA
+792,a beautiful woman
+793,The average Advadnoun twitter follower
+794,Synesthesia by Ryan Murdock
+795,Persephone flees Hades
+796,Last Breath
+797,a portrait of Persephone
+798,Homer Simpson
+799,totemic dusk
+800,a steampunk technomancer
+801,a portrait of Abraham Lincoln
+802,a cherry tree made of fractals
+803,bored of dying
+804,a famous painted portrait of Lady Macbeth
+805,a summer day
+806,A E S T H E T I C ?
+807,A vanitas still life that features twitter follower counts
+808,an illustration of a baby daikon radish in a tutu walking a dog
+809,Persephone
+810,pasta ömetabolism
+811,A vision of the Theotokos in my glass of coffee
+812,a dog.
+813,a photo of a person generating a painting of a person with AI
+814,🔴~__��'t �
+815,Intimations of Immortality
+816,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+817,A dead man
+818,The Oracle leans forward to say: beware the ides of March
+819,Monet Lisa
+820,a silent palace
+821,an intricate painting of eternity
+822,A propaganda poster for chunky cats.
+823,God killed Van Gogh.
+824,the eyes of God are wired shut
+825,Persephone
+826,symmetry
+827,Mona Lisa
+828,Saturn being a good dad to his son
+829,a technomancer
+830,
+831,a cherry tree made of fractals
+832,A cat wearing a tophat
+833,frogs in the style of Ralph Steadman
+834,a portrait of a beautiful person
+835,a green dog
+836,a portrait of Abraham Lincoln
+837,Hungry Dogs Will Devour in the Daytime
+838,a photo of a purple dog
+839,Cat in a teacup
+840,
+841,Nostos
+842,A baroque portrait of Hamlet
+843,Saturn being a good dad to his son
+844,Hell is Paradise
+845,a tasteful nude
+846,"God, it's amazing."
+847,Everywhere is no-place
+848,a minimalist painting that you wouldn't understand
+849,a tree with weaping branches
+850,a portrait of Elvis Presley
+851,a man standing alone in a wheat field
+852,Juliet
+853,I sold my soul at the crossroads
+854,a beautiful person
+855,photosynthesis
+856,
+857,"Mephisto, shrouded in smoke"
+858,playing Go with Death
+859,a painting of the last day
+860,totemic dusk
+861,Hell is Paradise
+862,a christmas card from the victorian era
+863,Good grief
+864,handsome commemorative garden pigeon
+865,a portrait of <name>
+866,a portrait of Abraham Lincoln
+867,she came in through the wall
+868,a sad man
+869,In the temple of God
+870,fuzzy pals hum
+871,a painting of a sycamore in
+872,a beautiful waluigi
+873,"a brilliant sketch titled ""Let Forever be Delayed"""
+874,a portrait of a beautiful person
+875,a portrait of Juliet
+876,MEMETIC HAZARD
+877,The years gild our memoriesnUnfairly.
+878,Mona Lisa
+879,pasta ömetabolism
+880,pasta ömetabolism
+881,bored of dying
+882,Cat in a teacup
+883,a cherry tree made of fractals
+884,an intricate drawing of eternity
+885,mammals
+886,a portrait of Persephone
+887,treehouse in the style of studio ghibli animation
+888,watching TV in purgatory
+889,The winds of change by Ryan Murdock
+890,a technomancer
+891,a portrait of Persephone
+892,Last Breath
+893,A minimalistic still life of a cat sitting on a table
+894,
+895,cult of prisms
+896,Aflame
+897,Cat in a teacup
+898,"God, it's amazing."
+899,a minimalist painting that you wouldn't understand
+900,a woman and a crow
+901,totemic dusk
+902,a city in Van Gogh's style
+903,A baroque portrait of Hamlet
+904,murdoch
+905,a silent palace
+906,Anxiety: the one emotion that does not lie
+907,a photo of a purple dog
+908,the moon is a sickle cell
+909,Tendrils of smoke curl around the caterpillar with a hookah
+910,president abe lincoln but a cat
+911,a beautiful woman
+912,handsome commemorative garden pigeon
+913,an intricate painting of eternity
+914,"God, it's amazing."
+915,Grippy socks; no drawstrings: high fashion
+916,The average Advadnoun twitter follower
+917,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+918,a photo from {my hometown}
+919,MEMETIC HAZARD
+920,a portrait of Elvis Presley
+921,a woman and a crow
+922,Saturn being a good dad to his son
+923,beautiful art
+924,Shinji Ikari
+925,a portrait of <name>
+926,a photo of a purple dog
+927,Ophelia
+928,a dog wearing a suit playing tennis
+929,We haunt the synapses
+930,I do not think they'll sing for me
+931,Genesis
+932,a beautiful person
+933,"a brilliant sketch titled ""Let Forever be Delayed"""
+934,Metaphysics
+935,bored of dying
+936,treehouse in the style of studio ghibli animation
+937,
+938,photosynthesis
+939,A structure made of people standing on top of other people
+940,meaningless neko ♡♡ neko
+941,a photo of the sun melting into the ocean
+942,symmetry
+943,the moon is a sickle cell
+944,Dancing in the moonlight
+945,Last Breath
+946,I sold my soul at the crossroads
+947,a beautiful woman
+948,"God, it's amazing."
+949,Cat in a teacup
+950,a tree with weaping branches
+951,"God, it's amazing."
+952,Cat in a teacup
+953,"r.j. Murdock's ""The Death of a Hacker"""
+954,using generated paint
+955,fuzzy pals hum
+956,"A portrait: man, whose lineage is corpse."
+957,a ginormous baby
+958,a beautiful woman
+959,"half Ryan, half pigeon"
+960,when the wind blows
+961,a beautiful woman
+962,pasta ömetabolism
+963,a cherry tree made of fractals
+964,The Monet Lisa
+965,"""The hunger artist, full"" by Ryan Murdock"
+966,a portrait of advadnoun
+967,The Fool tarot card but it's The Lovers
+968,Persephone
+969,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+970,an omen
+971,the eternal dread of lemongrab
+972,a man on fire
+973,Aflame
+974,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+975,twilight
+976,hamont parkland avenue incumbscreenshotsaturday hemisphere footage algorithm
+977,a silent palace
+978,a selfie
+979,the moon is a sickle cell
+980,a portrait of Abraham Lincoln
+981,a tree with weaping branches
+982,a tiny church inside an eyeball
+983,a portrait of a beautiful person
+984,Paradise Lost
+985,a horse with four eyes.
+986,president abe lincoln but a cat
+987,a summer day
+988,Anxiety: the one emotion that does not lie
+989,Saturn being a good dad to his son
+990,In the temple of God
+991,Redacted ████████
+992,Dr. Faustus and Mephisto
+993,a minimalist painting that you wouldn't understand
+994,a man standing alone in a wheat field
+995,a seance in the basement
+996,a portrait of <name>
+997,Aflame
+998,the moon is a sickle cell
+999,beautiful art
+1000,a man on fire
+1001,a tiny church inside an eyeball
+1002,totemic dusk
+1003,Persephone
+1004,piss indiefilm
+1005,a beautiful woman
+1006,The EcoCathedral
+1007,"joy, happiness, bliss"
+1008,Intimations of Immortality
+1009,the whitest man
+1010,a silent palace
+1011,
+1012,a woman and a crow
+1013,Memento Mori
+1014,Visions in envy of the gods
+1015,symmetry
+1016,A poster advertising Freudian Psychoanalytics
+1017,A propaganda poster promoting big chungus
+1018,With the Gods in envy of their visions
+1019,a cherry tree made of fractals
+1020,pasta ömetabolism
+1021,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1022,a beautiful person
+1023,cowboy with a trumpet
+1024,a portrait of a beautiful person
+1025,The OLD DATA
+1026,f*** it market standard rule language – distinguish np tax science research
+1027,murdoch
+1028,Some stolen Gods take up the reigns of darkness.
+1029,a portrait of Juliet
+1030,a tasteful nude
+1031,she sings opera
+1032,The First Supper
+1033,handsome commemorative garden pigeon
+1034,cult of prisms
+1035,Cat in a teacup
+1036,💨 👻 ☺ 🔮 🔺 ✊
+1037,a portrait of Abraham Lincoln
+1038,a corgi
+1039,a beautiful woman
+1040,a portrait of a beautiful person
+1041,Dead Codes by Ryan Murdock
+1042,totemic dusk
+1043,Juliet
+1044,a portrait of Elvis Presley
+1045,a criminal
+1046,Genesis where the universe was made
+1047,a portrait of <name>
+1048,turnt brony undergrad dwight
+1049,Cat in a teacup
+1050,a corgi
+1051,"Hamlet saying ""To be or not to be"""
+1052,a portrait of a beautiful person
+1053,A E S T H E T I C ?
+1054,Figure 5: a corgi
+1055,A gun killed Van Gogh.
+1056,Persephone flees Hades
+1057,a silent palace
+1058,pasta ömetabolism
+1059,a beautiful person
+1060,on the edge of grace
+1061,a portrait of Elvis Presley
+1062,Persephone
+1063,Tendrils of smoke curl around the caterpillar with a hookah
+1064,"half Ryan, half pigeon"
+1065,a sunflower
+1066,a beautiful person
+1067,a portrait of Juliet
+1068,A dead man
+1069,a character from a ghibli movie
+1070,a silent palace
+1071,a portrait of Elvis Presley
+1072,a portrait of advadnoun
+1073,A E S T H E T I C ?
+1074,зеленая собака
+1075,A baroque portrait of Hamlet
+1076,a man at the beach
+1077,Sisyphus
+1078,Good grief
+1079,"r.j. Murdock's ""The Death of a Hacker"""
+1080,a beautiful woman
+1081,🔴~__��'t �
+1082,a portrait of advadnoun
+1083,a painting of a sycamore in
+1084,president abe lincoln but a cat
+1085,The agony of time
+1086,God once loved a woman
+1087,pasta ömetabolism
+1088,Dead Codes by Ryan Murdock
+1089,
+1090,slightly mild cosplaying pseudo beard
+1091,Last Breath
+1092,The Oracle leans forward to say: beware the ides of March
+1093,The Devil Wears Khakis
+1094,"""The hunger artist, full"" by Ryan Murdock"
+1095,In the temple of God
+1096,a beautiful person
+1097,a man from an anime
+1098,She's gorgeous
+1099,A vanitas still life that features twitter follower counts
+1100,
+1101,the eternal dread of lemongrab
+1102,Advadnoun
+1103,a summer day
+1104,The Fool tarot card but it's The Lovers
+1105,I miss the Spring
+1106,an illustration of a baby daikon radish in a tutu walking a dog
+1107,The Oracle leans forward to say: beware the ides of March
+1108,Contentment at the Disco
+1109,The First Supper
+1110,Saturn being a good dad to his son
+1111,a beautiful woman
+1112,"Intricate, Weeping Tree by Ryan Murdock"
+1113,"a brilliant sketch titled ""Let Forever be Delayed"""
+1114,beautiful art
+1115,
+1116,a silent palace
+1117,a portrait of Juliet
+1118,A propaganda poster promoting big chungus
+1119,a portrait of a beautiful person
+1120,a portrait of Abraham Lincoln
+1121,
+1122,the whitest man
+1123,a portrait of Abe Lincoln
+1124,Monet Lisa
+1125,The Fool tarot card but it's The Lovers
+1126,a portrait of <name>
+1127,a portrait of Elvis Presley
+1128,Post-Modern Nouveaux Statue
+1129,a cherry tree made of fractals
+1130,f*** it market standard rule language – distinguish np tax science research
+1131,symmetry
+1132,pasta ömetabolism
+1133,a brilliant painting titled
+1134,The First Supper
+1135,a corgi
+1136,a beautiful person
+1137,a green doG
+1138,The OLD DATA
+1139,Ophelia
+1140,a portrait of Abraham Lincoln
+1141,incineratures motherhood
+1142,a green dog
+1143,a portrait of advadnoun
+1144,a sunflower
+1145,
+1146,a man from an anime
+1147,Beauty here -- a photograph by Ryan Murdock
+1148,slightly mild cosplaying pseudo beard
+1149,Nostos
+1150,pasta ömetabolism
+1151,a beautiful person
+1152,"half Ryan, half pigeon"
+1153,turnt brony undergrad dwight
+1154,beautiful art
+1155,a portrait of Persephone
+1156,A sticky-note magnum opus featuring birds
+1157,I sold my soul at the crossroads
+1158,"a brilliant sketch titled ""Let Forever be Delayed"""
+1159,A poster advertising Freudian Psychoanalytics
+1160,using generated paint
+1161,The OLD DATA
+1162,a horse with four eyes.
+1163,is this loss? but it's van gogh
+1164,a gorgeous bouquet with roses and sunflowers
+1165,Anxiety: the one emotion that does not lie
+1166,turnt brony undergrad dwight
+1167,The Lost Generation
+1168,Taylor Swift
+1169,The Lost Generation
+1170,a photo from {my hometown}
+1171,The OLD DATA
+1172,a portrait of <name>
+1173,a cherry tree made of fractals
+1174,an intricate sculpture of Death itself
+1175,
+1176,зеленая собака
+1177,a sunflower
+1178,angst
+1179,president abe lincoln but a cat
+1180,a beautiful person
+1181,The OLD DATA
+1182,"You shake the demons hand, and redo it all, again."
+1183,the latent space
+1184,Fire
+1185,a tree with weaping branches
+1186,treehouse in the style of studio ghibli animation
+1187,Good grief
+1188,a portrait of <name>
+1189,a wholesome clown. Not creepy at all
+1190,Theotokos of Milk
+1191,"God closes a door, boards up the stained-glass windows. nnGod hides."
+1192,I sold my soul at the crossroads
+1193,"Mephisto, shrouded in smoke"
+1194,A baroque portrait of Hamlet
+1195,a lamp
+1196,MEMETIC HAZARD
+1197,"""Your mind falls in the gaps"" - by Ryan Murdock"
+1198,cowboy with a trumpet
+1199,Aflame
+1200,A vanitas still life that features twitter follower counts
+1201,a beautiful person
+1202,Synesthesia
+1203,Is this loss?
+1204,Adverb working on Photoshop Neural Filters | Behance Art
+1205,Everything was beautiful and nothing hurt
+1206,Mona Lisa
+1207,A structure made of people standing on top of other people
+1208,"Intricate, Weeping Tree by Ryan Murdock"
+1209,the whitest man
+1210,The Fates knit such delicate nooses for us to bind
+1211,a tree with weaping branches
+1212,a beautiful person
+1213,Nostos
+1214,Post-Modern Nouveaux Statue
+1215,Genesis
+1216,totemic dusk
+1217,a dog.
+1218,photosynthesis
+1219,The average Advadnoun twitter follower
+1220,"""The hunger artist, full"" by Ryan Murdock"
+1221,a person's face
+1222,slightly mild cosplaying pseudo beard
+1223,a jukebox powered by smoke
+1224,Monet Lisa
+1225,Intimations of Immortality
+1226,a gorgeous bouquet with roses and sunflowers
+1227,face like an M.C. Escher drawing n(you could get lost in its beauty)
+1228,a photo of a purple dog
+1229,a tiny church inside an eyeball
+1230,Good grief
+1231,Last Breath
+1232,a beautiful waluigi
+1233,the moon is a sickle cell
+1234,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1235,I sold my soul at the crossroads
+1236,Persephone
+1237,a portrait of Abraham Lincoln
+1238,a beautiful painting
+1239,Last Breath
+1240,a man on fire
+1241,"a brilliant sketch titled ""Let Forever be Delayed"""
+1242,A gun killed Van Gogh.
+1243,a sketch of the mind of god
+1244,Intimations of Immortality
+1245,Intimations of Immortality
+1246,turnt brony undergrad dwight
+1247,A sticky-note magnum opus featuring birds
+1248,Aflame
+1249,Grippy socks; no drawstrings: high fashion
+1250,👉  👈
+1251,Shrek the ogre
+1252,a beautiful woman
+1253,a portrait of Elvis Presley
+1254,president abe lincoln but a cat
+1255,Post-antiquity art
+1256,using generated paint
+1257,a dog eating a cheese burger
+1258,The average Advadnoun twitter follower
+1259,Monet Lisa
+1260,"A professional, minimalist poster for the book The Old Man and the Sea"
+1261,We haunt the synapses
+1262,Post-Modern Nouveaux Statue
+1263,a picture of Ryan Murdock
+1264,cowboy with a trumpet
+1265,colorful rabbits chandelier polaroid
+1266,a character from a ghibli movie
+1267,a goblin by van gogh
+1268,a beautiful painting
+1269,a photo of a purple dog
+1270,a portrait of Persephone
+1271,"Hamlet saying ""To be or not to be"""
+1272,Homer Simpson
+1273,a cute cat
+1274,turnt brony undergrad dwight
+1275,Intimations of Immortality
+1276,a man wearing makeup
+1277,They called you the hyacinth girl
+1278,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1279,Cat in a teacup
+1280,Juliet
+1281,"""The wages of sin are generous"" by Ryan Murdock"
+1282,"Pig, neither dead nor alive, stare into the heart of light, the silence."
+1283,
+1284,a horse with four eyes.
+1285,Advadnoun
+1286,Last Breath
+1287,totemic dusk
+1288,The OLD DATA
+1289,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1290,a man holding an apple in one hand
+1291,a beautiful woman
+1292,melancholia
+1293,Shinji Ikari
+1294,a gorgeous bouquet with roses and sunflowers
+1295,a portrait of advadnoun
+1296,a tasteful nude
+1297,Genesis
+1298,In smoke and mould the fleshless dead
+1299,The average Advadnoun twitter follower
+1300,a cute cat
+1301,a painting of a sycamore in
+1302,a woman and a crow
+1303,Persephone
+1304,
+1305,using generated paint
+1306,"A cute, minmimalist valentine's day card featuring a cat"
+1307,a painting that couldn't be sold
+1308,bored of dying
+1309,pasta ömetabolism
+1310,Dancing in the moonlight
+1311,a beautiful woman
+1312,Dr. Faustus and Mephisto
+1313,"joy, happiness, bliss"
+1314,a photo from {my hometown}
+1315,a wholesome clown. Not creepy at all
+1316,a portrait of Elvis Presley
+1317,a cherry tree made of fractals
+1318,a man standing alone in a wheat field
+1319,Dancing in the moonlight
+1320,Hunger art by Ryan Murdock
+1321,a beautiful waluigi
+1322,A black and white photo of a rainbow.
+1323,totemic dusk
+1324,a beautiful person
+1325,
+1326,a beautiful woman
+1327,a horse with four eyes.
+1328,The Lost Generation
+1329,Death is a black camel that kneels down so we can ride
+1330,a ginormous baby
+1331,Dancing in the moonlight
+1332,an old man
+1333,a horse with four eyes.
+1334,a photo of a purple dog
+1335,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1336,a silent palace
+1337,The OLD DATA
+1338,a tree with weaping branches
+1339,Creativity is only composition in disguise.
+1340,"r.j. Murdock's ""The Death of a Hacker"""
+1341,Persephone
+1342,president abe lincoln but a cat
+1343,There is something so interesting about a bleeding edge full of dust.
+1344,A poster advertising death by water
+1345,Persephone
+1346,Saturn being a good dad to his son
+1347,is this loss? but it's van gogh
+1348,Monet Lisa
+1349,fuzzy pals hum
+1350,"""The hunger artist, full"" by Ryan Murdock"
+1351,Shinji Ikari
+1352,a beautiful woman
+1353,"Son of man,nYou cannot say, or guess, for you know onlynA heap of broken images"
+1354,God once loved a woman
+1355,a horse with four eyes.
+1356,a cherry tree made of fractals
+1357,a beautiful haunting
+1358,I miss the Spring
+1359,gradient
+1360,a wormhole
+1361,a beautiful woman
+1362,president abe lincoln but a cat
+1363,handsome commemorative garden pigeon
+1364,Everywhere is no-place
+1365,"""It is beginning to end.""nby Ryan Murdock."
+1366,she sings opera
+1367,a jukebox powered by smoke
+1368,a portrait of Juliet
+1369,playing Go with Death
+1370,a man standing alone in a wheat field
+1371,Dead Codes by Ryan Murdock
+1372,Synesthesia
+1373,The years gild our memoriesnUnfairly.
+1374,A propaganda poster promoting big chungus
+1375,"God, it's amazing."
+1376,Persephone
+1377,a beautiful person
+1378,MEMETIC HAZARD
+1379,totemic dusk
+1380,Intimations of Immortality
+1381,A poster advertising death by water
+1382,a photo of a purple dog
+1383,symmetry
+1384,A poster advertising misery
+1385,a portrait of Elvis Presley
+1386,Post-Modern Nouveaux Statue
+1387,a man from an anime
+1388,Anxiety: the one emotion that does not lie
+1389,photosynthesis
+1390,the man in the mirror
+1391,"half Ryan, half pigeon"
+1392,Sorrow's my body on the wavesnnAlone on the water
+1393,a seance in the basement
+1394,A poster serving as a memento mori
+1395,Aflame
+1396,A structure made of people standing on top of other people
+1397,The First Supper
+1398,totemic dusk
+1399,a beautiful person
+1400,a painting of the last day
+1401,a photo of Juliet
+1402,a horse with four eyes
+1403,pasta ömetabolism
+1404,Synesthesia
+1405,a cherry tree made of fractals
+1406,Post-post-post-post-modern art
+1407,pasta ömetabolism
+1408,MEMETIC HAZARD
+1409,a portrait of Abe Lincoln
+1410,Everywhere is no-place
+1411,Memento Mori
+1412,The average Advadnoun twitter follower
+1413,a beautiful painting
+1414,A black and white photo of a rainbow.
+1415,The Death of Achilles
+1416,a portrait of <name>
+1417,cult of prisms
+1418,a beautiful person
+1419,a beautiful painting
+1420,a beautiful woman
+1421,An Arundel Tomb
+1422,she came in through the wall
+1423,the moon is a sickle cell
+1424,a minimalist painting that you wouldn't understand
+1425,a tasteful nude
+1426,a gilded lily
+1427,a beautiful woman
+1428,a brilliant painting titled
+1429,a painting of the city
+1430,"""Your mind falls in the gaps"" - by Ryan Murdock"
+1431,"r.j. Murdock's ""The Death of a Hacker"""
+1432,Aflame
+1433,a beautiful painting
+1434,Juliet
+1435,turnt brony undergrad dwight
+1436,symmetry
+1437,Going home -- melanchonostalgic photography
+1438,a character from a ghibli movie
+1439,She's gorgeous
+1440,incineratures motherhood
+1441,a calm still life in ethereal blue
+1442,incineratures motherhood
+1443,A baroque portrait of Hamlet
+1444,"A professional, minimalist poster for the book The Old Man and the Sea"
+1445,Anxiety: the one emotion that does not lie
+1446,a portrait of a beautiful person
+1447,"Go off to sleep in the sunshine, I don’t want to see the day when it’s dying"
+1448,a tree with weaping branches
+1449,a tasteful nude
+1450,Intimations of Immortality
+1451,Weeping Roses
+1452,playing Go with Death
+1453,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1454,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1455,turnt brony undergrad dwight
+1456,Dancing in the moonlight
+1457,Figure 5: a corgi
+1458,a beautiful woman
+1459,A Tragedy
+1460,a photo of a purple dog
+1461,a famous painted portrait of Lady Macbeth
+1462,"A cute, minmimalist valentine's day card featuring a cat"
+1463,The things I'll take with me
+1464,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1465,Summer's Symphony: Counterpoint and Melody
+1466,a horse with four eyes
+1467,Aflame
+1468,a ginormous baby
+1469,
+1470,Saturn being a good dad to his son
+1471,a beautiful woman
+1472,a terrifying night hag
+1473,a portrait of Abraham Lincoln
+1474,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+1475,a cute cat
+1476,"""The hunger artist, full"" by Ryan Murdock"
+1477,A baroque portrait of Hamlet
+1478,a beautiful person
+1479,Last Breath
+1480,Juliet
+1481,"Go off to sleep in the sunshine, I don’t want to see the day when it’s dying"
+1482,"God, it's amazing."
+1483,a portrait of Abraham Lincoln
+1484,a woman and a crow
+1485,a portrait of Abraham Lincoln
+1486,Dancing in the moonlight
+1487,a tree with weaping branches
+1488,using generated paint
+1489,a gilded lily
+1490,treehouse in the style of studio ghibli animation
+1491,chiaroscuro
+1492,Last Breath
+1493,A dead man
+1494,a summer day
+1495,The fates knit such intricate nooses for us to bind.
+1496,bored of dying
+1497,🔴~__��'t �
+1498,Pig which could not cease to die.
+1499,Intimations of Immortality
+1500,a painting of a sycamore in
+1501,The Fool
+1502,she isn't busy: she just isn't into you
+1503,a beautiful person
+1504,"""The hunger artist, full"" by Ryan Murdock"
+1505,
+1506,a portrait of Elvis Presley
+1507,a woman and a crow
+1508,Homer Simpson
+1509,Anxiety: the one emotion that does not lie
+1510,A structure made of people standing on top of other people
+1511,a beautiful person
+1512,a beautiful person
+1513,totemic dusk
+1514,a christmas card from the victorian era
+1515,Sickness of the Soul
+1516,God is in heaven and all is right in the world
+1517,Mona Lisa
+1518,a portrait of Abraham Lincoln
+1519,a cute cat
+1520,turnt brony undergrad dwight
+1521,"a brilliant sketch titled ""Let Forever be Delayed"""
+1522,a city in Van Gogh's style
+1523,Synesthesia by Ryan Murdock
+1524,"""A God Made of Wires and Dust"" by Ryan Murdock"
+1525,a beautiful dawn
+1526,a portrait of Abraham Lincoln
+1527,
+1528,a horse with four eyes.
+1529,Last Breath
+1530,slightly mild cosplaying pseudo beard
+1531,
+1532,A dead man
+1533,cowboy with a trumpet
+1534,We haunt the synapses
+1535,
+1536,a horse with four eyes.
+1537,pasta ömetabolism
+1538,A short life full of immense joy
+1539,a wormhole
+1540,Juliet
+1541,is this loss? but it's van gogh
+1542,tamine ethereal image
+1543,is this loss? but it's van gogh
+1544,"A clock with gorgeous, intricate gradients on it"
+1545,Dancing in the moonlight
+1546,a broken heart
+1547,a wormhole
+1548,beautiful art
+1549,Genesis
+1550,face like an M.C. Escher drawing n(you could get lost in its beauty)
+1551,a character from a ghibli movie
+1552,Cat in a teacup
+1553,symmetry
+1554,A black and white photo of a rainbow.
+1555,A propaganda poster promoting big chungus
+1556,a woman and a crow
+1557,a green doG
+1558,"""The hunger artist, full"" by Ryan Murdock"
+1559,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1560,Last Breath
+1561,The Monet Lisa
+1562,all architecture
+1563,The Virgin Mary as a broken-down android
+1564,a terrifying night hag
+1565,a green doG
+1566,pasta ömetabolism
+1567,The Fool tarot card but it's The Lovers
+1568,Do you remember the mythic beast?nA last-minute cancellation at The Last Supper
+1569,the eternal dread of lemongrab
+1570,The warrior Achilles devours slain Hector's corpse -- an ink poster by Ryan Murdock
+1571,Shinji Ikari
+1572,The Monet Lisa
+1573,a cherry tree made of fractals
+1574,a portrait of Juliet
+1575,She's gorgeous
+1576,A black and white photo of a rainbow.
+1577,They called you the hyacinth girl
+1578,a portrait of <name>
+1579,photosynthesis
+1580,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1581,The Starry Night
+1582,"""A new hope blooms on the long notes of old horns."""
+1583,A minimalistic still life of a cat sitting on a table
+1584,a dog eating a cheese burger
+1585,A structure made of people standing on top of other people
+1586,Genesis
+1587,
+1588,"Oh the Death, not pigs forever."
+1589,The Starry Night
+1590,Persephone
+1591,a beautiful person
+1592,Sickness of the Soul
+1593,turnt brony undergrad dwight
+1594,a gilded lily
+1595,Photograph of a glass of Blue Tea
+1596,a woman and a crow
+1597,
+1598,a beautiful person
+1599,turnt brony undergrad dwight
+1600,mammals
+1601,The Lost Generation
+1602,a goblin by van gogh
+1603,A black and white photo of a rainbow.
+1604,"""Your mind flails in the gaps"" - by Ryan Murdock"
+1605,"half Ryan, half pigeon"
+1606,An Arundel Tomb
+1607,pasta ömetabolism
+1608,A dandelion blown into the universe
+1609,a man at the beach
+1610,Monet Lisa
+1611,"r.j. Murdock's ""The Death of a Hacker"""
+1612,Saturn being a good dad to his son
+1613,The Starry Night
+1614,a beautiful person
+1615,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1616,an old man
+1617,an intricate sculpture of Death itself
+1618,Genesis
+1619,a cherry tree made of fractals
+1620,a beautiful woman
+1621,a beautiful woman
+1622,an illustration of a baby daikon radish in a tutu walking a dog
+1623,
+1624,the latent space
+1625,A dead man
+1626,
+1627,frogs in the style of Ralph Steadman
+1628,a cherry tree made of fractals
+1629,fuzzy pals hum
+1630,a tiny church inside an eyeball
+1631,Aflame
+1632,a sunflower
+1633,Nostos
+1634,Monet Lisa
+1635,Monet Lisa
+1636,a cherry tree made of fractals
+1637,Cat in a teacup
+1638,I miss the Spring
+1639,a beautiful person
+1640,Redacted ████████
+1641,"God, it's amazing."
+1642,a portrait of <name>
+1643,Shrek the ogre
+1644,Super Mario World but every character is Luigi
+1645,God killed Van Gogh.
+1646,"A cute, minmimalist valentine's day card featuring a cat"
+1647,She's gorgeous
+1648,a sunflower
+1649,the sun is shining on the lake
+1650,the intersection of art and technology
+1651,a beautiful woman
+1652,a beautiful painting
+1653,Paradise Lost
+1654,president abe lincoln but a cat
+1655,
+1656,"""The Penultimate Supper"" by Da Vinci"
+1657,On the edge of endless darkness
+1658,With the Gods in envy of their visions
+1659,Dril is a cyber-philosopher.
+1660,"r.j. Murdock's ""The Death of a Hacker"""
+1661,
+1662,a picture of Ryan Murdock
+1663,A E S T H E T I C ?
+1664,deepdream aka inceptionism
+1665,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1666,a beautiful woman
+1667,Homer Simpson
+1668,Persephone
+1669,the whitest man
+1670,handsome commemorative garden pigeon
+1671,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1672,a minimalist painting that you wouldn't understand
+1673,a beautiful person
+1674,Monet Lisa
+1675,Monet Lisa
+1676,cult of prisms
+1677,"a ""This machine kills Trojans"" sticker on a Greek lyre"
+1678,The agony of time
+1679,turnt brony undergrad dwight
+1680,the whitest man
+1681,Dril is a cyber-philosopher.
+1682,Alan Turing
+1683,when the wind blows
+1684,a portrait of Persephone
+1685,deepdream aka inceptionism
+1686,Dead Codes by Ryan Murdock
+1687,Saturn being a good dad to his son
+1688,a portrait of Abraham Lincoln
+1689,The Theotokos is a bird
+1690,a beautiful woman
+1691,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+1692,a corgi
+1693,a green doG
+1694,A E S T H E T I C ?
+1695,
+1696,the intersection of art and technology
+1697,Dead Codes by Ryan Murdock
+1698,a cute rabbit
+1699,"God, it's amazing."
+1700,a silent palace
+1701,a wholesome clown. Not creepy at all
+1702,Exquisite LonelinessnnLurid art by Ryan Murdock
+1703,A structure made of people standing on top of other people
+1704,Dead Codes by Ryan Murdock
+1705,a gorgeous bouquet with roses and sunflowers
+1706,a portrait of <name>
+1707,intricate nothing
+1708,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1709,Metaphysics
+1710,using generated paint
+1711,a minimalist painting that you wouldn't understand
+1712,she sings opera
+1713,Cat in a teacup
+1714,turnt brony undergrad dwight
+1715,a beautiful woman
+1716,"""The hunger artist, full"" by Ryan Murdock"
+1717,The years gild our memoriesnUnfairly.
+1718,a woman and a crow
+1719,A vanitas still life that features twitter follower counts
+1720,The Monet Lisa
+1721,a gorgeous bouquet with roses and sunflowers
+1722,Philosophy is really homesickness: the urge to be at home everywhere
+1723,a green doG
+1724,an omen
+1725,An elegant image of nature with gorgeous swirling gradients by R.J. Murdock
+1726,a cute corgi
+1727,cowboy with a trumpet
+1728,"The laptop of brave Achaean Achilles, who would not live long."
+1729,a portrait of a beautiful woman
+1730,slightly mild cosplaying pseudo beard
+1731,a man standing alone in a wheat field
+1732,Aflame
+1733,a portrait of Persephone
+1734,a woman and a crow
+1735,I sold my soul at the crossroads
+1736,the demise of the universe
+1737,a portrait of a beautiful person
+1738,"Mephisto, shrouded in smoke"
+1739,a portrait of advadnoun
+1740,God is in heaven and all is right in the world
+1741,a cherry tree made of fractals
+1742,Odysseus speaks to the shades in Hades
+1743,a steampunk technomancer
+1744,a woman and a crow
+1745,treehouse in the style of studio ghibli animation
+1746,a gorgeous bouquet with roses and sunflowers
+1747,🎷
+1748,a cherry tree made of fractals
+1749,"A cute, minmimalist valentine's day card featuring a cat"
+1750,a famous painted portrait of Lady Macbeth
+1751,pasta ömetabolism
+1752,A short life full of immense joy
+1753,a terrifying night hag
+1754,a horse with four eyes.
+1755,A baroque portrait of Hamlet
+1756,this person is
+1757,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1758,"a brilliant sketch titled ""Let Forever be Delayed"""
+1759,baby metal
+1760,a character from a ghibli movie
+1761,a corgi
+1762,the massive hope nof early iterations
+1763,a portrait of a beautiful person
+1764,Intimations of Immortality
+1765,a silent palace
+1766,Post-post-post-post-modern art
+1767,a person's face
+1768,"r.j. Murdock's ""The Death of a Hacker"""
+1769,a cherry tree made of fractals
+1770,Ophelia
+1771,A E S T H E T I C ?
+1772,
+1773,
+1774,Genesis
+1775,Persephone
+1776,Last Breath
+1777,a portrait of Abraham Lincoln
+1778,The OLD DATA
+1779,the whitest man
+1780,a minimalist painting that you wouldn't understand
+1781,God once loved a woman
+1782,totemic dusk
+1783,when the wind blows
+1784,treehouse in the style of studio ghibli animation
+1785,a corgi
+1786,Last Breath
+1787,slightly mild cosplaying pseudo beard
+1788,a portrait of a beautiful woman
+1789,
+1790,a photo from {my hometown}
+1791,Dancing in the moonlight
+1792,Everywhere is no-place
+1793,Post-post-post-post-modern art
+1794,👉  👈
+1795,
+1796,a woman and a crow
+1797,"half Ryan, half pigeon"
+1798,president abe lincoln but a cat
+1799,A propaganda poster promoting big chungus
+1800,"""The hunger artist, full"" by Ryan Murdock"
+1801,a painting that couldn't be sold
+1802,a beautiful haunting
+1803,a technomancer
+1804,"""A God Made of Wires and Dust"" by Ryan Murdock"
+1805,little birds
+1806,"""The hunger artist, full"" by Ryan Murdock"
+1807,"""The hunger artist, full"" by Ryan Murdock"
+1808,rooted reflected worries
+1809,is this loss? but it's van gogh
+1810,a portrait of <name>
+1811,a beautiful person
+1812,a photo portrait of Joe Bidenthulu
+1813,a dog eating a cheese burger
+1814,Aflame
+1815,"a brilliant sketch titled ""Let Forever be Delayed"""
+1816,Aflame
+1817,Aflame
+1818,a beautiful haunting
+1819,totemic dusk
+1820,"""The hunger artist, full"" by Ryan Murdock"
+1821,Intimations of Immortality
+1822,"""Your mind fails in the gaps"" - by Ryan Murdock"
+1823,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1824,a dog.
+1825,a green doG
+1826,The Lost Generation
+1827,Last Breath
+1828,intricate nothing
+1829,"God, it's amazing."
+1830,this person is
+1831,a silent palace
+1832,a dog eating a cheese burger
+1833,Genesis
+1834,a calm still life in ethereal blue
+1835,slightly mild cosplaying pseudo beard
+1836,A propaganda poster promoting big chungus
+1837,is this loss? but it's van gogh
+1838,Dancing in the moonlight
+1839,a corgi
+1840,🔴~__��'t �
+1841,totemic dusk
+1842,a ginormous baby
+1843,Dancing in the moonlight
+1844,a photo from {my hometown}
+1845,a beautiful Waluigi
+1846,human
+1847,A black and white photo of a rainbow.
+1848,a beautiful person
+1849,"""Cameras can't make art""nnAn oil on canvas by Murdock"
+1850,a cherry tree made of fractals
+1851,a beautiful person
+1852,Taylor Swift
+1853,a man on fire
+1854,Post-Modern Nouveaux Statue
+1855,is this loss? but it's van gogh
+1856,a man at the beach
+1857,a beautiful person
+1858,"""The hunger artist, full"" by Ryan Murdock"
+1859,The OLD DATA
+1860,Dancing in the moonlight
+1861,A structure made of people standing on top of other people
+1862,a horse with four eyes.
+1863,�>: ican read wii
+1864,a portrait of Abraham Lincoln
+1865,A propaganda poster for chunky cats.
+1866,
+1867,The Death of Achilles
+1868,on the edge of grace
+1869,I did not mean it I wanted a cute clever cartoon I swear.
+1870,a handwritten obituary
+1871,a man standing alone in a wheat field
+1872,the intersection of art and technology
+1873,Memento Mori
+1874,a portrait of a beautiful woman
+1875,cigar sammycorgi
+1876,a steampunk technomancer
+1877,"Sons are like birds, flying always over the mountain"
+1878,The Lost Generation
+1879,a minimalist painting that you wouldn't understand
+1880,A black and white photo of a rainbow.
+1881,a man holding an apple in one hand
+1882,🔴~__��'t �
+1883,🍰  🇺 🎓 🐶
+1884,a man holding an apple in one hand
+1885,a sketch of the mind of god
+1886,treehouse in the style of studio ghibli animation
+1887,Beauty here -- a photograph by Ryan Murdock
+1888,A E S T H E T I C ?
+1889,a selfie
+1890,is this loss? but it's van gogh
+1891,Costco wedding
+1892,a beautiful person
+1893,a green doG
+1894,symmetry
+1895,a dog eating a cheese burger
+1896,a summer day
+1897,"""A God Made of Wires and Dust"" by Ryan Murdock"
+1898,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1899,a portrait of a beautiful woman
+1900,зеленая собака
+1901,"joy, happiness, bliss"
+1902,Juliet
+1903,a wholesome clown. Not creepy at all
+1904,meaningless neko ♡♡ neko
+1905,I can read when there's writing on the wall
+1906,"Oh the Death, not pigs forever."
+1907,a minimalist painting that you wouldn't understand
+1908,Aflame
+1909,Super Mario World but every character is Luigi
+1910,/
+1911,Dead Codes by Ryan Murdock
+1912,A vanitas still life that features twitter follower counts
+1913,a beautiful woman
+1914,a lamp
+1915,
+1916,the eyes of God are wired shut
+1917,intricate nothing
+1918,Is this loss?
+1919,a photo of a purple dog
+1920,a lamp
+1921,totemic dusk
+1922,The average Advadnoun twitter follower
+1923,photosynthesis
+1924,Costco wedding
+1925,🔴~__��'t �
+1926,Aflame
+1927,a cherry tree made of fractals
+1928,an intricate painting of eternity
+1929,Saturn being a good dad to his son
+1930,Nostos
+1931,a beautiful person
+1932,A gargoyle of wires and flesh
+1933,🎷
+1934,a beautiful person
+1935,a tasteful nude
+1936,Faceless Sorrow
+1937,a gorgeous bouquet with roses and sunflowers
+1938,using generated paint
+1939,A Tragedy
+1940,зеленая собака
+1941,🔴~__��'t �
+1942,A Tragedy
+1943,A sticky-note magnum opus featuring birds
+1944,president abe lincoln but a cat
+1945,using generated paint
+1946,
+1947,Intimations of Immortality
+1948,a portrait of <name>
+1949,a silent palace
+1950,A poster advertising death by water
+1951,A propaganda poster promoting big chungus
+1952,totemic dusk
+1953,a horse with four eyes.
+1954,cigar sammycorgi
+1955,"""It is beginning to end.""nby Ryan Murdock."
+1956,all architecture
+1957,a portrait of Abraham Lincoln
+1958,"joy, happiness, bliss"
+1959,a man with a beard
+1960,Genesis
+1961,👉  👈
+1962,Summer's Symphony: Counterpoint and Melody
+1963,A gun killed Van Gogh.
+1964,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1965,A minimalist propaganda poster promoting panpsychism
+1966,Persephone
+1967,a goblin by van gogh
+1968,"""A new hope blooms on the long notes of old horns."""
+1969,a painting of the city
+1970,
+1971,The agony of time
+1972,Ophelia
+1973,turnt brony undergrad dwight
+1974,a beautiful person
+1975,totemic dusk
+1976,The Fool tarot card but it's The Lovers
+1977,
+1978,a broken heart
+1979,"Rise, Oink, Lazarus of Bethany"
+1980,"""The hunger artist, full"" by Ryan Murdock"
+1981,a cherry tree made of fractals
+1982,an intricate painting of eternity
+1983,She's gorgeous
+1984,a beautiful person
+1985,I will meet you in a field firmly set within wrong.nnBy Ryan Murdock
+1986,using generated paint
+1987,a portrait of Abe Lincoln
+1988,Persephone flees Hades
+1989,a steampunk technomancer
+1990,a beautiful woman
+1991,"A portrait: man, whose lineage is corpse."
+1992,🔴~__��'t �
+1993,Intimations of Immortality
+1994,an omen
+1995,Persephone
+1996,"God closes a door, boards up stained-glass windows."
+1997,"""A new hope blooms on the long notes of old horns."""
+1998,Fire
+1999,
+2000,Metaphysics
+2001,"""The hunger artist, full"" by Ryan Murdock"
+2002,when the wind blows
+2003,a portrait of a beautiful person
+2004,The Lost Generation
+2005,a corgi
+2006,a beautiful woman
+2007,pasta ömetabolism
+2008,a sad man
+2009,Juliet
+2010,a painting of a sycamore in
+2011,a portrait of Abraham Lincoln
+2012,The Fates knit such delicate nooses for us to bind
+2013,a photo from {my hometown}
+2014,a tree with leaves that are amarillo sightseeing thetic
+2015,Sickness of the Soul
+2016,pasta ömetabolism
+2017,pasta ömetabolism
+2018,bored of dying
+2019,An Arundel Tomb
+2020,The Starry Night
+2021,Nostos
+2022,bored of dying
+2023,The Lost Generation
+2024,The average Advadnoun twitter follower
+2025,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+2026,a silent palace
+2027,beautiful art
+2028,
+2029,Last Breath
+2030,
+2031,a tasteful nude
+2032,a portrait of advadnoun
+2033,a portrait of a beautiful person
+2034,a man holding an apple in one hand
+2035,a gorgeous bouquet with roses and sunflowers
+2036,photosynthesis
+2037,God killed Van Gogh.
+2038,Saturn being a good dad to his son
+2039,a horse with four eyes.
+2040,a beautiful woman
+2041,a beautiful person
+2042,a portrait of Abe Lincoln
+2043,totemic dusk
+2044,A Tragedy
+2045,Persephone
+2046,The OLD DATA
+2047,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+2048,face like an M.C. Escher drawing n(you could get lost in its beauty)
+2049,Dead Codes by Ryan Murdock
+2050,Intimations of Immortality
+2051,turnt brony undergrad dwight
+2052,a photo of a purple dog
+2053,Cat in a teacup
+2054,🔴~__��'t �
+2055,turnt brony undergrad dwight
+2056,Beauty here -- a photo by r.j. Murdock
+2057,The Fool
+2058,a portrait of Juliet
+2059,a jukebox powered by smoke
+2060,cowboy with a trumpet
+2061,twilight
+2062,"joy, happiness, bliss"
+2063,Dead Codes by Ryan Murdock
+2064,"a brilliant sketch titled ""Let Forever be Delayed"""
+2065,tamine ethereal image
+2066,a portrait of <name>
+2067,"God, it's amazing."
+2068,she came in through the wall
+2069,Fire
+2070,Juliet
+2071,God killed Van Gogh.
+2072,a portrait of Persephone
+2073,a beautiful person
+2074,the whitest man
+2075,Somewhere where I am not.nIntricate beauty by Ryan Murdock.
+2076,a gilded lily
+2077,The Lost Generation
+2078,Dead Codes by Ryan Murdock
+2079,Intimations of Immortality
+2080,meaningless neko ♡♡ neko
+2081,beautiful art
+2082,"""The hunger artist, full"" by Ryan Murdock"
+2083,an intricate painting of eternity
+2084,Good grief
+2085,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+2086,The Fool
diff --git a/Optimus/data/datasets/README.md b/Optimus/data/datasets/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8c4682b9c9a14661a479db8be763884829deca03
--- /dev/null
+++ b/Optimus/data/datasets/README.md
@@ -0,0 +1,28 @@
+# Pre-processing dataset
+
+## Pre-trained dataset: Wikipedia
+
+The dataset can be downloaded here. We split the original wiki text into 298 files, and loop over files in one epoch.
+
+We filter each sentence in wiki based on two constraints: (1) The sentence length is smaller than 64. (2) The tokenized sentence length is smaller than 256 (so that the encoder can take the entire sentence).
+
+To filter the sentence, please change the data folders and run the script:
+
+    sh scripts/scripts_local/run_data_filtering_wiki.sh
+
+The filtered files are saved in "data/datasets/wikipedia_json_64_filtered".
+
+
+## Fine-tuning datasets
+
+Language Modeling: Penn, Yelp, Yahoo, Snli
+
+
+(Stylized) Dialog response generation: DailyDialog, Holmes
+
+
+Label-conditional text generation: Yelp.
+
+
+Language Understanding: GLUE, Yelp.
+
diff --git a/Optimus/data/datasets/debug_data.zip b/Optimus/data/datasets/debug_data.zip
new file mode 100644
index 0000000000000000000000000000000000000000..62e20df85e880e533e876673cf4bf2ef863dfc25
--- /dev/null
+++ b/Optimus/data/datasets/debug_data.zip
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b2f48abb22feed028edf0d33a5786bec4881160e987e4d134f23ca1a68b4c9d0
+size 8207
diff --git a/Optimus/data/datasets/debug_data/cached_lm_gpt_bert_100_test.json b/Optimus/data/datasets/debug_data/cached_lm_gpt_bert_100_test.json
new file mode 100755
index 0000000000000000000000000000000000000000..d65babe15ab17ec3c87d8cd6efd1ea33066ba129
--- /dev/null
+++ b/Optimus/data/datasets/debug_data/cached_lm_gpt_bert_100_test.json
@@ -0,0 +1 @@
+[{"bert_token": [101, 170, 1299, 1198, 133, 8362, 1377, 135, 1114, 13093, 4182, 2399, 1103, 10284, 1141, 1314, 1159, 119, 102], "bert_token_length": 19, "gpt2_token": [50258, 257, 582, 655, 1279, 2954, 29, 351, 12317, 4890, 5341, 262, 781, 1133, 530, 938, 640, 764, 50259], "gpt2_token_length": 19}, {"bert_token": [101, 1103, 1372, 1104, 1234, 1866, 1107, 1103, 1768, 119, 102], "bert_token_length": 11, "gpt2_token": [50258, 262, 1448, 286, 661, 6204, 287, 262, 2214, 764, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 1234, 6546, 1126, 7814, 3838, 119, 102], "bert_token_length": 8, "gpt2_token": [50258, 661, 11969, 281, 15162, 10010, 764, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 1685, 1299, 19476, 2624, 5753, 102], "bert_token_length": 7, "gpt2_token": [50258, 1862, 582, 31017, 12586, 50259], "gpt2_token_length": 6}, {"bert_token": [101, 170, 3676, 12957, 1194, 170, 2487, 1118, 1199, 3546, 119, 102], "bert_token_length": 12, "gpt2_token": [50258, 257, 3290, 39416, 832, 257, 4324, 416, 617, 6134, 764, 50259], "gpt2_token_length": 12}, {"bert_token": [101, 1299, 1107, 18679, 5464, 1120, 170, 17287, 1952, 1113, 170, 21162, 1285, 119, 102], "bert_token_length": 15, "gpt2_token": [50258, 582, 287, 36251, 7722, 379, 257, 26725, 3084, 319, 257, 27737, 1110, 764, 50259], "gpt2_token_length": 15}, {"bert_token": [101, 1103, 1590, 133, 8362, 1377, 135, 133, 8362, 1377, 135, 7688, 176, 19224, 1111, 1289, 1974, 1988, 1105, 1110, 1208, 2613, 1796, 1111, 2657, 2209, 1106, 1435, 119, 102], "bert_token_length": 30, "gpt2_token": [50258, 262, 2415, 1279, 2954, 29, 1279, 2954, 29, 2208, 22749, 329, 1021, 1256, 295, 290, 318, 783, 4953, 2354, 329, 3315, 3241, 284, 1282, 764, 50259], "gpt2_token_length": 27}, {"bert_token": [101, 1160, 1535, 1132, 2807, 1107, 133, 8362, 1377, 135, 3092, 1171, 8391, 1112, 1141, 1827, 1166, 1123, 2342, 1120, 170, 15302, 1769, 2482, 119, 102], "bert_token_length": 26, "gpt2_token": [50258, 734, 1466, 389, 5586, 287, 1279, 2954, 29, 8539, 736, 18791, 355, 530, 2173, 625, 607, 8163, 379, 257, 20239, 1692, 3785, 764, 50259], "gpt2_token_length": 25}, {"bert_token": [101, 1103, 17989, 1110, 1107, 170, 4382, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 262, 46612, 318, 287, 257, 7072, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 1590, 133, 8362, 1377, 135, 3323, 1105, 12792, 102], "bert_token_length": 10, "gpt2_token": [50258, 2415, 1279, 2954, 29, 24730, 290, 33041, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 170, 2963, 1873, 2355, 170, 1894, 5828, 2884, 1107, 1141, 1289, 1105, 5497, 1121, 1122, 1114, 1103, 1168, 117, 1199, 1104, 1122, 1113, 1123, 1339, 119, 102], "bert_token_length": 28, "gpt2_token": [50258, 257, 5156, 2576, 4769, 257, 2266, 7309, 3091, 287, 530, 1021, 290, 6600, 422, 340, 351, 262, 584, 837, 617, 286, 340, 319, 607, 1986, 764, 50259], "gpt2_token_length": 28}, {"bert_token": [101, 1210, 6363, 1132, 1919, 1487, 1106, 1147, 3283, 112, 188, 1402, 1106, 3940, 4014, 119, 102], "bert_token_length": 17, "gpt2_token": [50258, 1115, 6844, 389, 2491, 1978, 284, 511, 4958, 705, 82, 2156, 284, 4483, 8073, 764, 50259], "gpt2_token_length": 17}, {"bert_token": [101, 170, 1299, 1110, 2033, 2407, 1106, 13477, 6628, 1126, 16355, 119, 102], "bert_token_length": 13, "gpt2_token": [50258, 257, 582, 318, 1972, 3492, 284, 11662, 7521, 281, 15422, 764, 50259], "gpt2_token_length": 13}, {"bert_token": [101, 1103, 2854, 7081, 15775, 1110, 2504, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 262, 4771, 8566, 27763, 318, 4692, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 170, 8781, 1590, 1110, 3179, 1106, 2283, 1123, 2236, 102], "bert_token_length": 11, "gpt2_token": [50258, 257, 32749, 2415, 318, 6155, 284, 1826, 607, 3128, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 170, 1590, 1107, 170, 20399, 2969, 17373, 1123, 1739, 1229, 2288, 1107, 170, 14206, 2984, 119, 102], "bert_token_length": 18, "gpt2_token": [50258, 257, 2415, 287, 257, 49807, 10147, 38744, 607, 5101, 981, 5055, 287, 257, 16918, 3650, 764, 50259], "gpt2_token_length": 18}, {"bert_token": [101, 170, 1299, 1105, 1117, 1676, 1132, 3179, 1113, 170, 2771, 13868, 119, 102], "bert_token_length": 14, "gpt2_token": [50258, 257, 582, 290, 465, 3656, 389, 6155, 319, 257, 3272, 11152, 764, 50259], "gpt2_token_length": 14}, {"bert_token": [101, 170, 1299, 1110, 1217, 13781, 1118, 170, 3676, 102], "bert_token_length": 10, "gpt2_token": [50258, 257, 582, 318, 852, 26172, 416, 257, 3290, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 170, 8295, 1197, 3351, 1126, 5925, 10815, 13863, 1107, 1103, 5282, 119, 102], "bert_token_length": 14, "gpt2_token": [50258, 257, 275, 18320, 5762, 281, 10912, 14335, 17445, 287, 262, 8701, 764, 50259], "gpt2_token_length": 14}, {"bert_token": [101, 1103, 1590, 4307, 170, 2221, 9606, 1105, 1307, 1103, 3323, 102], "bert_token_length": 12, "gpt2_token": [50258, 262, 2415, 12408, 257, 4171, 23967, 290, 2826, 262, 24730, 50259], "gpt2_token_length": 12}, {"bert_token": [101, 170, 1685, 2298, 16526, 1117, 1339, 1154, 1117, 4885, 1104, 2094, 119, 102], "bert_token_length": 14, "gpt2_token": [50258, 257, 1862, 2933, 31048, 465, 1986, 656, 465, 7480, 286, 2057, 764, 50259], "gpt2_token_length": 14}, {"bert_token": [101, 170, 2343, 4206, 20357, 1116, 1150, 1125, 170, 3774, 11972, 1120, 1250, 119, 102], "bert_token_length": 15, "gpt2_token": [50258, 257, 15787, 41494, 508, 550, 257, 5975, 21493, 379, 670, 764, 50259], "gpt2_token_length": 13}, {"bert_token": [101, 1175, 1132, 1185, 1441, 5260, 1796, 170, 2689, 1555, 119, 102], "bert_token_length": 12, "gpt2_token": [50258, 612, 389, 645, 1450, 9272, 2354, 257, 4158, 2139, 764, 50259], "gpt2_token_length": 12}, {"bert_token": [101, 1160, 1234, 1684, 1113, 9305, 4883, 1121, 170, 3664, 119, 102], "bert_token_length": 12, "gpt2_token": [50258, 734, 661, 1762, 319, 10829, 6729, 422, 257, 9753, 764, 50259], "gpt2_token_length": 12}, {"bert_token": [101, 170, 1299, 3179, 1205, 1103, 16331, 23071, 1117, 1171, 1114, 170, 3499, 9145, 1107, 1103, 3582, 1112, 1103, 3336, 3741, 1166, 1103, 1447, 119, 102], "bert_token_length": 26, "gpt2_token": [50258, 257, 582, 6155, 866, 262, 17748, 34688, 465, 736, 351, 257, 8848, 28499, 287, 262, 4469, 355, 262, 4252, 5621, 625, 262, 1660, 764, 50259], "gpt2_token_length": 26}, {"bert_token": [101, 1234, 1132, 5578, 1106, 170, 3838, 1229, 2903, 170, 2337, 2842, 119, 102], "bert_token_length": 14, "gpt2_token": [50258, 661, 389, 8680, 284, 257, 10010, 981, 4964, 257, 3155, 9280, 764, 50259], "gpt2_token_length": 14}, {"bert_token": [101, 1103, 1590, 1110, 2613, 1111, 1103, 3592, 1106, 6657, 119, 102], "bert_token_length": 12, "gpt2_token": [50258, 262, 2415, 318, 4953, 329, 262, 1323, 284, 9240, 764, 50259], "gpt2_token_length": 12}, {"bert_token": [101, 170, 1825, 1110, 3351, 170, 2221, 6131, 119, 102], "bert_token_length": 10, "gpt2_token": [50258, 257, 1048, 318, 5762, 257, 4171, 6877, 764, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 170, 3676, 2326, 1166, 3712, 4033, 1702, 1111, 1117, 3172, 102], "bert_token_length": 12, "gpt2_token": [50258, 257, 3290, 4539, 625, 5894, 4534, 2045, 329, 465, 4870, 50259], "gpt2_token_length": 12}, {"bert_token": [101, 170, 1299, 4462, 1107, 8492, 2944, 1710, 3459, 1110, 5923, 1114, 170, 5141, 1107, 170, 3642, 119, 102], "bert_token_length": 19, "gpt2_token": [50258, 257, 582, 12049, 287, 599, 7115, 2151, 8242, 318, 15360, 351, 257, 10846, 287, 257, 6576, 764, 50259], "gpt2_token_length": 19}, {"bert_token": [101, 1103, 2067, 6767, 1200, 133, 8362, 1377, 135, 13481, 1114, 1117, 1297, 102], "bert_token_length": 14, "gpt2_token": [50258, 262, 3881, 5424, 527, 1279, 2954, 29, 32695, 351, 465, 1204, 50259], "gpt2_token_length": 13}, {"bert_token": [101, 1685, 2298, 16708, 1107, 1353, 21291, 1113, 170, 5017, 2221, 2186, 119, 102], "bert_token_length": 14, "gpt2_token": [50258, 1862, 2933, 38193, 287, 1402, 47434, 319, 257, 9480, 4171, 7850, 764, 50259], "gpt2_token_length": 14}, {"bert_token": [101, 1353, 1482, 1107, 11620, 2288, 1107, 1103, 5282, 119, 102], "bert_token_length": 11, "gpt2_token": [50258, 1402, 1751, 287, 22551, 5055, 287, 262, 8701, 764, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 1103, 2027, 1110, 1136, 3351, 7537, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 262, 1200, 318, 407, 5762, 15232, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 170, 5141, 13587, 5679, 1154, 5679, 13624, 1114, 7366, 5911, 1113, 1172, 119, 102], "bert_token_length": 15, "gpt2_token": [50258, 257, 10846, 23147, 8887, 656, 8887, 14180, 351, 15061, 3601, 319, 606, 764, 50259], "gpt2_token_length": 15}, {"bert_token": [101, 1210, 1685, 1482, 1505, 1107, 1103, 5282, 1223, 170, 2780, 119, 102], "bert_token_length": 13, "gpt2_token": [50258, 1115, 1862, 1751, 711, 287, 262, 8701, 739, 257, 5509, 764, 50259], "gpt2_token_length": 13}, {"bert_token": [101, 1103, 2927, 11697, 1132, 1656, 119, 102], "bert_token_length": 8, "gpt2_token": [50258, 262, 2318, 1213, 389, 2641, 764, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 170, 15387, 2274, 1103, 1730, 1170, 1515, 1126, 133, 8362, 1377, 135, 1354, 119, 102], "bert_token_length": 16, "gpt2_token": [50258, 257, 34632, 2753, 262, 1085, 706, 1719, 281, 1279, 2954, 29, 1807, 764, 50259], "gpt2_token_length": 15}, {"bert_token": [101, 170, 1372, 1104, 1234, 10482, 1796, 1104, 170, 3227, 1200, 119, 102], "bert_token_length": 13, "gpt2_token": [50258, 257, 1448, 286, 661, 8960, 2354, 286, 257, 12172, 525, 764, 50259], "gpt2_token_length": 13}, {"bert_token": [101, 170, 1299, 1107, 1653, 2288, 1120, 170, 16976, 1107, 1524, 1104, 170, 4764, 1146, 1894, 20552, 119, 102], "bert_token_length": 19, "gpt2_token": [50258, 257, 582, 287, 2330, 5055, 379, 257, 21822, 287, 2166, 286, 257, 10645, 510, 2266, 26373, 764, 50259], "gpt2_token_length": 19}]
\ No newline at end of file
diff --git a/Optimus/data/datasets/debug_data/cached_lm_gpt_bert_100_train.json b/Optimus/data/datasets/debug_data/cached_lm_gpt_bert_100_train.json
new file mode 100755
index 0000000000000000000000000000000000000000..093650c7be8deacf4b1971d0c5925afbe5c6872d
--- /dev/null
+++ b/Optimus/data/datasets/debug_data/cached_lm_gpt_bert_100_train.json
@@ -0,0 +1 @@
+[{"bert_token": [101, 170, 1825, 12761, 1113, 170, 2067, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 257, 1048, 38207, 319, 257, 3881, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 1103, 11742, 1110, 3058, 119, 102], "bert_token_length": 7, "gpt2_token": [50258, 262, 2685, 78, 318, 7586, 764, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 1160, 1535, 1132, 5116, 9370, 20433, 1111, 1103, 15390, 119, 102], "bert_token_length": 12, "gpt2_token": [50258, 734, 1466, 389, 8179, 13157, 27678, 329, 262, 13222, 764, 50259], "gpt2_token_length": 12}, {"bert_token": [101, 170, 1467, 1110, 3351, 9901, 11710, 102], "bert_token_length": 8, "gpt2_token": [50258, 257, 4097, 318, 5762, 12336, 20858, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 1103, 1447, 1108, 9020, 119, 102], "bert_token_length": 7, "gpt2_token": [50258, 262, 1660, 373, 14081, 764, 50259], "gpt2_token_length": 7}, {"bert_token": [101, 170, 1299, 1110, 187, 13024, 119, 102], "bert_token_length": 8, "gpt2_token": [50258, 257, 582, 318, 374, 868, 764, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 1122, 1110, 133, 8362, 1377, 135, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 340, 318, 1279, 2954, 29, 764, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 1175, 1132, 2636, 1115, 1132, 1113, 1103, 2854, 13029, 1147, 2854, 5923, 119, 102], "bert_token_length": 15, "gpt2_token": [50258, 612, 389, 4813, 326, 389, 319, 262, 4771, 18207, 511, 4771, 15360, 764, 50259], "gpt2_token_length": 15}, {"bert_token": [101, 170, 1590, 1110, 5569, 170, 182, 15680, 1113, 170, 2472, 1114, 1242, 1168, 1234, 5569, 182, 15680, 1116, 1481, 1123, 117, 1229, 5118, 1468, 1105, 12438, 1116, 7311, 1107, 1103, 2863, 9008, 119, 102], "bert_token_length": 35, "gpt2_token": [50258, 257, 2415, 318, 10311, 257, 285, 19458, 319, 257, 4675, 351, 867, 584, 661, 10311, 285, 404, 5379, 2157, 607, 837, 981, 4269, 364, 290, 31931, 8181, 287, 262, 7150, 16965, 764, 50259], "gpt2_token_length": 34}, {"bert_token": [101, 1103, 1299, 1144, 170, 4937, 1107, 1117, 1289, 119, 102], "bert_token_length": 11, "gpt2_token": [50258, 262, 582, 468, 257, 9845, 287, 465, 1021, 764, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 1103, 1441, 24945, 1380, 119, 102], "bert_token_length": 7, "gpt2_token": [50258, 262, 1450, 25432, 1223, 764, 50259], "gpt2_token_length": 7}, {"bert_token": [101, 170, 1372, 1104, 18456, 20547, 1116, 1132, 1543, 170, 1769, 15931, 1120, 170, 3163, 1342, 119, 102], "bert_token_length": 18, "gpt2_token": [50258, 257, 1448, 286, 14042, 37553, 389, 1642, 257, 1692, 27944, 379, 257, 9669, 983, 764, 50259], "gpt2_token_length": 17}, {"bert_token": [101, 1372, 1104, 1234, 1107, 1103, 15211, 16360, 8171, 1554, 1104, 2094, 119, 102], "bert_token_length": 14, "gpt2_token": [50258, 1448, 286, 661, 287, 262, 22775, 24157, 10559, 1336, 286, 2057, 764, 50259], "gpt2_token_length": 14}, {"bert_token": [101, 1103, 1210, 1441, 1132, 4395, 1141, 1330, 119, 102], "bert_token_length": 10, "gpt2_token": [50258, 262, 1115, 1450, 389, 5742, 530, 1194, 764, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 1103, 1299, 1144, 1126, 6337, 1189, 1121, 3926, 119, 102], "bert_token_length": 11, "gpt2_token": [50258, 262, 582, 468, 281, 8875, 925, 422, 6953, 764, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 1234, 1107, 1103, 2243, 1104, 1331, 2472, 4405, 1118, 1415, 2275, 119, 102], "bert_token_length": 14, "gpt2_token": [50258, 661, 287, 262, 3504, 286, 1748, 4675, 11191, 416, 1588, 6832, 764, 50259], "gpt2_token_length": 14}, {"bert_token": [101, 170, 15387, 18081, 1116, 1205, 170, 4665, 119, 102], "bert_token_length": 10, "gpt2_token": [50258, 257, 34632, 42563, 866, 257, 12788, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 170, 1299, 1110, 1702, 1166, 1614, 1115, 1132, 1107, 170, 1469, 2319, 119, 102], "bert_token_length": 15, "gpt2_token": [50258, 257, 582, 318, 2045, 625, 1243, 326, 389, 287, 257, 1957, 1910, 764, 50259], "gpt2_token_length": 15}, {"bert_token": [101, 1299, 1773, 170, 3058, 1105, 1653, 3651, 2092, 119, 102], "bert_token_length": 11, "gpt2_token": [50258, 582, 2712, 257, 7586, 290, 2330, 5186, 10047, 764, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 1234, 1280, 5947, 1107, 1103, 5969, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 661, 1016, 14899, 287, 262, 9151, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 170, 2298, 14836, 1372, 1110, 14249, 23178, 119, 102], "bert_token_length": 10, "gpt2_token": [50258, 257, 2933, 24490, 1448, 318, 24522, 24349, 764, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 1103, 9227, 1110, 1107, 170, 12533, 1155, 1602, 11378, 102], "bert_token_length": 11, "gpt2_token": [50258, 262, 38619, 318, 287, 257, 14262, 477, 2042, 16313, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 1210, 1441, 1132, 1107, 1473, 4524, 102], "bert_token_length": 8, "gpt2_token": [50258, 1115, 1450, 389, 287, 1918, 19272, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 170, 1873, 3351, 170, 1653, 2969, 1105, 2221, 5831, 8987, 1228, 170, 2067, 1154, 1103, 5387, 119, 102], "bert_token_length": 19, "gpt2_token": [50258, 257, 2576, 5762, 257, 2330, 10147, 290, 4171, 21029, 14284, 572, 257, 3881, 656, 262, 6450, 764, 50259], "gpt2_token_length": 19}, {"bert_token": [101, 1299, 4642, 1106, 19726, 1873, 1118, 10398, 1154, 1103, 1447, 119, 102], "bert_token_length": 13, "gpt2_token": [50258, 582, 8404, 284, 14947, 2576, 416, 23186, 656, 262, 1660, 764, 50259], "gpt2_token_length": 13}, {"bert_token": [101, 7589, 5575, 1164, 1106, 2303, 1228, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 8383, 11029, 546, 284, 2121, 572, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 24438, 1591, 1110, 3351, 170, 1653, 6029, 119, 102], "bert_token_length": 10, "gpt2_token": [50258, 294, 2137, 318, 5762, 257, 2330, 8187, 764, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 170, 3676, 1110, 1773, 1107, 1103, 5781, 1447, 119, 102], "bert_token_length": 11, "gpt2_token": [50258, 257, 3290, 318, 2712, 287, 262, 15191, 1660, 764, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 1103, 2298, 1110, 5947, 11786, 1107, 1103, 4528, 102], "bert_token_length": 10, "gpt2_token": [50258, 262, 2933, 318, 14899, 18177, 287, 262, 5933, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 170, 1709, 2154, 4518, 1117, 1981, 1113, 1141, 1104, 1117, 1591, 112, 188, 2342, 119, 102], "bert_token_length": 17, "gpt2_token": [50258, 257, 4346, 3985, 5137, 465, 3211, 319, 530, 286, 465, 2137, 705, 82, 8163, 764, 50259], "gpt2_token_length": 17}, {"bert_token": [101, 1103, 1299, 7086, 1146, 1117, 3227, 1200, 119, 102], "bert_token_length": 10, "gpt2_token": [50258, 262, 582, 9808, 510, 465, 12172, 525, 764, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 170, 1590, 10498, 170, 4261, 102], "bert_token_length": 7, "gpt2_token": [50258, 257, 2415, 17607, 257, 6614, 50259], "gpt2_token_length": 7}, {"bert_token": [101, 1103, 1300, 2636, 1508, 1149, 1103, 1783, 102], "bert_token_length": 9, "gpt2_token": [50258, 262, 1440, 4813, 1234, 503, 262, 2046, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 1103, 133, 8362, 1377, 135, 1110, 5578, 1106, 1390, 102], "bert_token_length": 11, "gpt2_token": [50258, 262, 1279, 2954, 29, 318, 8680, 284, 2647, 50259], "gpt2_token_length": 10}, {"bert_token": [101, 1103, 15050, 1132, 1120, 1103, 4640, 119, 102], "bert_token_length": 9, "gpt2_token": [50258, 262, 15508, 389, 379, 262, 10481, 764, 50259], "gpt2_token_length": 9}, {"bert_token": [101, 170, 176, 11071, 1158, 5235, 119, 102], "bert_token_length": 8, "gpt2_token": [50258, 257, 1036, 4509, 8414, 764, 50259], "gpt2_token_length": 7}, {"bert_token": [101, 170, 1590, 1110, 8739, 1114, 170, 1415, 9814, 119, 102], "bert_token_length": 11, "gpt2_token": [50258, 257, 2415, 318, 10801, 351, 257, 1588, 1787, 764, 50259], "gpt2_token_length": 11}, {"bert_token": [101, 2719, 2489, 1113, 1103, 2472, 119, 102], "bert_token_length": 8, "gpt2_token": [50258, 7912, 2356, 319, 262, 4675, 764, 50259], "gpt2_token_length": 8}, {"bert_token": [101, 1175, 1132, 1160, 1685, 3287, 1114, 15909, 3447, 1150, 1132, 1796, 1105, 1132, 10750, 1107, 1103, 6786, 119, 102], "bert_token_length": 20, "gpt2_token": [50258, 612, 389, 734, 1862, 6510, 351, 33677, 82, 508, 389, 2354, 290, 389, 18894, 287, 262, 13647, 764, 50259], "gpt2_token_length": 20}, {"bert_token": [101, 1160, 1441, 1107, 2221, 1141, 2288, 1103, 1168, 5205, 2135, 1103, 2487, 1104, 170, 14172, 1610, 1198, 1702, 1149, 119, 102], "bert_token_length": 22, "gpt2_token": [50258, 734, 1450, 287, 4171, 530, 5055, 262, 584, 10938, 4291, 262, 4324, 286, 257, 29957, 1097, 655, 2045, 503, 764, 50259], "gpt2_token_length": 22}]
\ No newline at end of file
diff --git a/Optimus/data/datasets/debug_data/test.txt b/Optimus/data/datasets/debug_data/test.txt
new file mode 100755
index 0000000000000000000000000000000000000000..b76881615fae2808bbc78bcb7f00a841e0189269
--- /dev/null
+++ b/Optimus/data/datasets/debug_data/test.txt
@@ -0,0 +1,40 @@
+a man just <unk> with lung cancer plays the flute one last time .
+the group of people stood in the field .
+people attending an outdoor concert .
+young man skipping rocks
+a dog leans through a window by some plants .
+man in sunglasses drinking at a cafe table on a sunny day .
+the woman <unk> <unk> super glue for hand lotion and is now waiting outside for medical attention to come .
+two women are sitting in <unk> wing back chairs as one points over her shoulder at a colorful human figure .
+the waiter is in a restaurant .
+woman <unk> drums and sings
+a baby girl holding a red plastic box in one hand and eating from it with the other , some of it on her face .
+three dogs are running together to their master 's house to eat dinner .
+a man is getting ready to spray paint an advertisement .
+the ice cream cone is cold .
+a blond woman is walking to meet her date
+a woman in a striped shirt folds her arms while standing in a grocery store .
+a man and his wife are walking on a crosswalk .
+a man is being chased by a dog
+a biker wearing an orange helmet rides in the grass .
+the woman wore a blue skirt and played the drums
+a young boy presses his face into his plate of food .
+a seamstress who had a surprise visitor at work .
+there are no men gathered outside a religious service .
+two people working on removing snow from a roof .
+a man walking down the pier scratching his back with a boat sailing in the background as the sun sets over the water .
+people are listening to a concert while watching a couple dance .
+the woman is waiting for the bus to arrive .
+a person is wearing a blue hat .
+a dog runs over dry earth looking for his owner
+a man dressed in spanish party clothes is dancing with a lady in a dress .
+the rock climber <unk> escapes with his life
+young boy drifting in small canoe on a calm blue river .
+small children in uniforms standing in the grass .
+the child is not wearing glasses .
+a lady pouring tea into tea cups with flower print on them .
+three young children play in the grass under a tree .
+the barbers are inside .
+a cyclist takes the lead after having an <unk> thought .
+a group of people relax outside of a camper .
+a man in white standing at a microphone in front of a lip up red backdrop .
diff --git a/Optimus/data/datasets/debug_data/train.txt b/Optimus/data/datasets/debug_data/train.txt
new file mode 100755
index 0000000000000000000000000000000000000000..a2f4215a70495bab7be3e8bcb81e2ba7763e767b
--- /dev/null
+++ b/Optimus/data/datasets/debug_data/train.txt
@@ -0,0 +1,40 @@
+a person dances on a rock .
+the cello is brown .
+two women are busy collecting hay for the harvest .
+a band is wearing matching shirts
+the water was lovely .
+a man is raking .
+it is <unk> .
+there are girls that are on the ice practicing their ice dancing .
+a woman is riding a moped on a street with many other people riding mopeds behind her , while streamers and banners hang in the trees overhead .
+the man has a knife in his hand .
+the men assemble something .
+a group of cheerleaders are making a human pyramid at a basketball game .
+group of people in the wilderness packing boxes full of food .
+the three men are helping one another .
+the man has an instrument made from iron .
+people in the middle of city street surrounded by large buildings .
+a cyclist pedals down a hill .
+a man is looking over things that are in a local market .
+man playing a brown and white electric guitar .
+people going swimming in the ocean .
+a boy scout group is hiking outdoors .
+the dancer is in a boring all black outfit
+three men are in death valley
+a girl wearing a white shirt and blue jeans jumping off a rock into the sand .
+man tries to impress girl by diving into the water .
+worker sleeping about to fall off .
+th player is wearing a white uniform .
+a dog is playing in the shore water .
+the boy is swimming happily in the pool
+a football coach putting his arm on one of his player 's shoulder .
+the man opens up his camper .
+a woman flies a plane
+the four girls put out the fire
+the <unk> is listening to music
+the teens are at the beach .
+a grilling contest .
+a woman is cooking with a large pot .
+artists pain on the street .
+there are two young boys with shovels who are outside and are digging in the dirt .
+two men in blue one standing the other hanging onto the window of a tram car just looking out .
diff --git a/Optimus/data/datasets/debug_data/valid.txt b/Optimus/data/datasets/debug_data/valid.txt
new file mode 100755
index 0000000000000000000000000000000000000000..75e7e2c45218297036a535dec270748dbe89d82b
--- /dev/null
+++ b/Optimus/data/datasets/debug_data/valid.txt
@@ -0,0 +1,40 @@
+there is a large group of people at the beach .
+a young woman performs an ancient tribal dance .
+a little girl in a pink dress sitting in the grass next to her baby doll .
+the woman is a ventriloquist .
+a woman wearing a blue jeans and light green shirt with scarf is facing a concrete wall with her hands in her pocket .
+a woman is embarrassed by the way she looks and covers the man 's eyes .
+a family walks down to the park
+the room is decorated for a party .
+people are standing by the ocean .
+a man is eating watermelon .
+an african-american man wearing a blank tank top is dancing .
+a group of people taking a stroll on the beach at sunset .
+a man cutting open a fruit with a large knife .
+there is a boy on a swing .
+a performer wrapped in a fabric hangs by her ankle from a scaffolding .
+the band playing is from a local high school .
+a dance teacher and a young student .
+a man in a santa hat playing the xylophone .
+a couple sits in an antique automobile in a populated building .
+there are some construction workers putting up a wall .
+three men wearing black jackets are talking to two ladies in tanks tops .
+a man is sitting against a pole on the beach while reading a paper .
+guys spending the <unk> on tv
+two boys stand near a drinking fountain as one takes a drink .
+a female is getting into a pool .
+a white dog retrieves the stick from the lake for her owner .
+the photo shows a woman playing drums .
+a woman sits on a couch with her feet up and her back to a wood wall with a computer in her lap starring at the screen .
+customers wait for food at a busy outdoor restaurant .
+passersby holding umbrellas shop for vegetables at a street market .
+a poor asian women working at a dam .
+a man 's arms putting some paper in a copying machine .
+children sitting on a rock getting their photo taken
+the man is riding a harley-davidson
+people standing on sand looking at clouds .
+a dog is running to get the ball .
+a woman and her son walk down a long , <unk> path , while two people ride their horses in the opposite direction .
+to workers mop .
+the man is lying on a sofa .
+a girl wearing a loose blue shirt and an beanie walks in a lonely wood .
diff --git a/Optimus/data/datasets/glue_data/collect_one_glue_data.py b/Optimus/data/datasets/glue_data/collect_one_glue_data.py
new file mode 100755
index 0000000000000000000000000000000000000000..6f5f760bba47a0bc74712715bf5842b77dd2082f
--- /dev/null
+++ b/Optimus/data/datasets/glue_data/collect_one_glue_data.py
@@ -0,0 +1,128 @@
+''' 
+Script for downloading all GLUE data.
+
+Note: for legal reasons, we are unable to host MRPC.
+You can either use the version hosted by the SentEval team, which is already tokenized, 
+or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
+For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
+You should then rename and place specific files in a folder (see below for an example).
+
+mkdir MRPC
+cabextract MSRParaphraseCorpus.msi -d MRPC
+cat MRPC/_2DEC3DBE877E4DB192D17C0256E90F1D | tr -d $'\r' > MRPC/msr_paraphrase_train.txt
+cat MRPC/_D7B391F9EAFF4B1B8BCE8F21B20B1B61 | tr -d $'\r' > MRPC/msr_paraphrase_test.txt
+rm MRPC/_*
+rm MSRParaphraseCorpus.msi
+
+1/30/19: It looks like SentEval is no longer hosting their extracted and tokenized MRPC data, so you'll need to download the data from the original source for now.
+2/11/19: It looks like SentEval actually *is* hosting the extracted data. Hooray!
+'''
+
+import os
+import sys
+import shutil
+import argparse
+import tempfile
+import urllib.request
+import zipfile
+
+TASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI", "SNLI", "QNLI", "RTE", "WNLI" ]
+
+MRPC_TRAIN = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt'
+MRPC_TEST = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt'
+
+def download_and_extract(task, data_dir):
+    print("Downloading and extracting %s..." % task)
+    data_file = "%s.zip" % task
+    urllib.request.urlretrieve(TASK2PATH[task], data_file)
+    with zipfile.ZipFile(data_file) as zip_ref:
+        zip_ref.extractall(data_dir)
+    os.remove(data_file)
+    print("\tCompleted!")
+
+def format_mrpc(data_dir, path_to_data):
+    print("Processing MRPC...")
+    mrpc_dir = os.path.join(data_dir, "MRPC")
+    if not os.path.isdir(mrpc_dir):
+        os.mkdir(mrpc_dir)
+    if path_to_data:
+        mrpc_train_file = os.path.join(path_to_data, "msr_paraphrase_train.txt")
+        mrpc_test_file = os.path.join(path_to_data, "msr_paraphrase_test.txt")
+    else:
+        print("Local MRPC data not specified, downloading data from %s" % MRPC_TRAIN)
+        mrpc_train_file = os.path.join(mrpc_dir, "msr_paraphrase_train.txt")
+        mrpc_test_file = os.path.join(mrpc_dir, "msr_paraphrase_test.txt")
+        urllib.request.urlretrieve(MRPC_TRAIN, mrpc_train_file)
+        urllib.request.urlretrieve(MRPC_TEST, mrpc_test_file)
+    assert os.path.isfile(mrpc_train_file), "Train data not found at %s" % mrpc_train_file
+    assert os.path.isfile(mrpc_test_file), "Test data not found at %s" % mrpc_test_file
+    urllib.request.urlretrieve(TASK2PATH["MRPC"], os.path.join(mrpc_dir, "dev_ids.tsv"))
+
+    dev_ids = []
+    with open(os.path.join(mrpc_dir, "dev_ids.tsv"), encoding="utf8") as ids_fh:
+        for row in ids_fh:
+            dev_ids.append(row.strip().split('\t'))
+
+    with open(mrpc_train_file, encoding="utf8") as data_fh, \
+         open(os.path.join(mrpc_dir, "train.tsv"), 'w', encoding="utf8") as train_fh, \
+         open(os.path.join(mrpc_dir, "dev.tsv"), 'w', encoding="utf8") as dev_fh:
+        header = data_fh.readline()
+        train_fh.write(header)
+        dev_fh.write(header)
+        for row in data_fh:
+            label, id1, id2, s1, s2 = row.strip().split('\t')
+            if [id1, id2] in dev_ids:
+                dev_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
+            else:
+                train_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
+
+    with open(mrpc_test_file, encoding="utf8") as data_fh, \
+            open(os.path.join(mrpc_dir, "test.tsv"), 'w', encoding="utf8") as test_fh:
+        header = data_fh.readline()
+        test_fh.write("index\t#1 ID\t#2 ID\t#1 String\t#2 String\n")
+        for idx, row in enumerate(data_fh):
+            label, id1, id2, s1, s2 = row.strip().split('\t')
+            test_fh.write("%d\t%s\t%s\t%s\t%s\n" % (idx, id1, id2, s1, s2))
+    print("\tCompleted!")
+
+def download_diagnostic(data_dir):
+    print("Downloading and extracting diagnostic...")
+    if not os.path.isdir(os.path.join(data_dir, "diagnostic")):
+        os.mkdir(os.path.join(data_dir, "diagnostic"))
+    data_file = os.path.join(data_dir, "diagnostic", "diagnostic.tsv")
+    urllib.request.urlretrieve(TASK2PATH["diagnostic"], data_file)
+    print("\tCompleted!")
+    return
+
+def get_tasks(task_names):
+    task_names = task_names.split(',')
+    if "all" in task_names:
+        tasks = TASKS
+    else:
+        tasks = []
+        for task_name in task_names:
+            assert task_name in TASKS, "Task %s not found!" % task_name
+            tasks.append(task_name)
+    return tasks
+
+def main(arguments):
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--data_dir', help='directory to save data to', type=str, default='./')
+    parser.add_argument('--tasks', help='tasks to download data for as a comma separated string',
+                        type=str, default='all')
+    parser.add_argument('--path_to_mrpc', help='path to directory containing extracted MRPC data, msr_paraphrase_train.txt and msr_paraphrase_text.txt',
+                        type=str, default='')
+    args = parser.parse_args(arguments)
+
+    if not os.path.isdir(args.data_dir):
+        os.mkdir(args.data_dir)
+    tasks = get_tasks(args.tasks)
+
+    for task in tasks:
+        extract_and integrate(task, args.data_dir)
+
+
+if __name__ == '__main__':
+    sys.exit(main(sys.argv[1:]))
+
+
diff --git a/Optimus/data/datasets/glue_data/download_glue_data.py b/Optimus/data/datasets/glue_data/download_glue_data.py
new file mode 100755
index 0000000000000000000000000000000000000000..94bacabcde4a0838a65fb6104dbb86de12284ecb
--- /dev/null
+++ b/Optimus/data/datasets/glue_data/download_glue_data.py
@@ -0,0 +1,144 @@
+''' 
+Script for downloading all GLUE data.
+
+Note: for legal reasons, we are unable to host MRPC.
+You can either use the version hosted by the SentEval team, which is already tokenized, 
+or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
+For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
+You should then rename and place specific files in a folder (see below for an example).
+
+mkdir MRPC
+cabextract MSRParaphraseCorpus.msi -d MRPC
+cat MRPC/_2DEC3DBE877E4DB192D17C0256E90F1D | tr -d $'\r' > MRPC/msr_paraphrase_train.txt
+cat MRPC/_D7B391F9EAFF4B1B8BCE8F21B20B1B61 | tr -d $'\r' > MRPC/msr_paraphrase_test.txt
+rm MRPC/_*
+rm MSRParaphraseCorpus.msi
+
+1/30/19: It looks like SentEval is no longer hosting their extracted and tokenized MRPC data, so you'll need to download the data from the original source for now.
+2/11/19: It looks like SentEval actually *is* hosting the extracted data. Hooray!
+'''
+
+import os
+import sys
+import shutil
+import argparse
+import tempfile
+import urllib.request
+import zipfile
+
+TASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI", "SNLI", "QNLI", "RTE", "WNLI", "diagnostic"]
+TASK2PATH = {"CoLA":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FCoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4',
+             "SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
+             "MRPC":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc',
+             "QQP":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQQP.zip?alt=media&token=700c6acf-160d-4d89-81d1-de4191d02cb5',
+             "STS":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSTS-B.zip?alt=media&token=bddb94a7-8706-4e0d-a694-1109e12273b5',
+             "MNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FMNLI.zip?alt=media&token=50329ea1-e339-40e2-809c-10c40afff3ce',
+             "SNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSNLI.zip?alt=media&token=4afcfbb2-ff0c-4b2d-a09a-dbf07926f4df',
+             "QNLI": 'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQNLIv2.zip?alt=media&token=6fdcf570-0fc5-4631-8456-9505272d1601',
+             "RTE":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FRTE.zip?alt=media&token=5efa7e85-a0bb-4f19-8ea2-9e1840f077fb',
+             "WNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FWNLI.zip?alt=media&token=068ad0a0-ded7-4bd7-99a5-5e00222e0faf',
+             "diagnostic":'https://storage.googleapis.com/mtl-sentence-representations.appspot.com/tsvsWithoutLabels%2FAX.tsv?GoogleAccessId=firebase-adminsdk-0khhl@mtl-sentence-representations.iam.gserviceaccount.com&Expires=2498860800&Signature=DuQ2CSPt2Yfre0C%2BiISrVYrIFaZH1Lc7hBVZDD4ZyR7fZYOMNOUGpi8QxBmTNOrNPjR3z1cggo7WXFfrgECP6FBJSsURv8Ybrue8Ypt%2FTPxbuJ0Xc2FhDi%2BarnecCBFO77RSbfuz%2Bs95hRrYhTnByqu3U%2FYZPaj3tZt5QdfpH2IUROY8LiBXoXS46LE%2FgOQc%2FKN%2BA9SoscRDYsnxHfG0IjXGwHN%2Bf88q6hOmAxeNPx6moDulUF6XMUAaXCSFU%2BnRO2RDL9CapWxj%2BDl7syNyHhB7987hZ80B%2FwFkQ3MEs8auvt5XW1%2Bd4aCU7ytgM69r8JDCwibfhZxpaa4gd50QXQ%3D%3D'}
+
+MRPC_TRAIN = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt'
+MRPC_TEST = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt'
+
+def download_and_extract(task, data_dir):
+    print("Downloading and extracting %s..." % task)
+    data_file = "%s.zip" % task
+    urllib.request.urlretrieve(TASK2PATH[task], data_file)
+    with zipfile.ZipFile(data_file) as zip_ref:
+        zip_ref.extractall(data_dir)
+    os.remove(data_file)
+    print("\tCompleted!")
+
+def format_mrpc(data_dir, path_to_data):
+    print("Processing MRPC...")
+    mrpc_dir = os.path.join(data_dir, "MRPC")
+    if not os.path.isdir(mrpc_dir):
+        os.mkdir(mrpc_dir)
+    if path_to_data:
+        mrpc_train_file = os.path.join(path_to_data, "msr_paraphrase_train.txt")
+        mrpc_test_file = os.path.join(path_to_data, "msr_paraphrase_test.txt")
+    else:
+        print("Local MRPC data not specified, downloading data from %s" % MRPC_TRAIN)
+        mrpc_train_file = os.path.join(mrpc_dir, "msr_paraphrase_train.txt")
+        mrpc_test_file = os.path.join(mrpc_dir, "msr_paraphrase_test.txt")
+        urllib.request.urlretrieve(MRPC_TRAIN, mrpc_train_file)
+        urllib.request.urlretrieve(MRPC_TEST, mrpc_test_file)
+    assert os.path.isfile(mrpc_train_file), "Train data not found at %s" % mrpc_train_file
+    assert os.path.isfile(mrpc_test_file), "Test data not found at %s" % mrpc_test_file
+    urllib.request.urlretrieve(TASK2PATH["MRPC"], os.path.join(mrpc_dir, "dev_ids.tsv"))
+
+    dev_ids = []
+    with open(os.path.join(mrpc_dir, "dev_ids.tsv"), encoding="utf8") as ids_fh:
+        for row in ids_fh:
+            dev_ids.append(row.strip().split('\t'))
+
+    with open(mrpc_train_file, encoding="utf8") as data_fh, \
+         open(os.path.join(mrpc_dir, "train.tsv"), 'w', encoding="utf8") as train_fh, \
+         open(os.path.join(mrpc_dir, "dev.tsv"), 'w', encoding="utf8") as dev_fh:
+        header = data_fh.readline()
+        train_fh.write(header)
+        dev_fh.write(header)
+        for row in data_fh:
+            label, id1, id2, s1, s2 = row.strip().split('\t')
+            if [id1, id2] in dev_ids:
+                dev_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
+            else:
+                train_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
+
+    with open(mrpc_test_file, encoding="utf8") as data_fh, \
+            open(os.path.join(mrpc_dir, "test.tsv"), 'w', encoding="utf8") as test_fh:
+        header = data_fh.readline()
+        test_fh.write("index\t#1 ID\t#2 ID\t#1 String\t#2 String\n")
+        for idx, row in enumerate(data_fh):
+            label, id1, id2, s1, s2 = row.strip().split('\t')
+            test_fh.write("%d\t%s\t%s\t%s\t%s\n" % (idx, id1, id2, s1, s2))
+    print("\tCompleted!")
+
+def download_diagnostic(data_dir):
+    print("Downloading and extracting diagnostic...")
+    if not os.path.isdir(os.path.join(data_dir, "diagnostic")):
+        os.mkdir(os.path.join(data_dir, "diagnostic"))
+    data_file = os.path.join(data_dir, "diagnostic", "diagnostic.tsv")
+    urllib.request.urlretrieve(TASK2PATH["diagnostic"], data_file)
+    print("\tCompleted!")
+    return
+
+def get_tasks(task_names):
+    task_names = task_names.split(',')
+    if "all" in task_names:
+        tasks = TASKS
+    else:
+        tasks = []
+        for task_name in task_names:
+            assert task_name in TASKS, "Task %s not found!" % task_name
+            tasks.append(task_name)
+    return tasks
+
+def main(arguments):
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--data_dir', help='directory to save data to', type=str, default='glue_data')
+    parser.add_argument('--tasks', help='tasks to download data for as a comma separated string',
+                        type=str, default='all')
+    parser.add_argument('--path_to_mrpc', help='path to directory containing extracted MRPC data, msr_paraphrase_train.txt and msr_paraphrase_text.txt',
+                        type=str, default='')
+    args = parser.parse_args(arguments)
+
+    if not os.path.isdir(args.data_dir):
+        os.mkdir(args.data_dir)
+    tasks = get_tasks(args.tasks)
+
+    for task in tasks:
+        if task == 'MRPC':
+            format_mrpc(args.data_dir, args.path_to_mrpc)
+        elif task == 'diagnostic':
+            download_diagnostic(args.data_dir)
+        else:
+            download_and_extract(task, args.data_dir)
+
+
+if __name__ == '__main__':
+    sys.exit(main(sys.argv[1:]))
+
+
diff --git a/Optimus/data/download_datasets.md b/Optimus/data/download_datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..7fc370451d88d7d2eeeb13dd4bdafdb65c290f1e
--- /dev/null
+++ b/Optimus/data/download_datasets.md
@@ -0,0 +1,30 @@
+# Download/Pre-process Datasets
+
+## Wikipedia
+
+Option |  Files    | Size | Data |
+| -------- | ------- |  -------- | ------- |
+|1 | Processed Files in Zip  | 11.78G| [Download](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/wikipedia_json_64_filtered.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)    |
+|2 | Raw Text  | 11.79G| [Download](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/wikipedia.segmented.nltk.txt?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)      |
+
+
+Our pre-processing protocal: We split the original wiki text into 298 files, and loop over files in one epoch. We filter each sentence in wiki based on two constraints: (1) The sentence length is smaller than 64. (2) The tokenized sentence length is smaller than 256 (so that the encoder can take the entire sentence). To filter the sentence, please change the data folders and run the script:
+
+    sh scripts/scripts_local/run_data_filtering_wiki.sh
+
+The filtered files are saved in "data/datasets/wikipedia_json_64_filtered".
+
+
+## Fine-tuning datasets
+
+Language Modeling: Penn, Yelp, Yahoo, Snli. A tiny dataset is also provided for the purpose of debugging
+
+Dataset |  Files    | 
+| -------- | ------- |
+| Penn | [Zip](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/penn_data.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)|
+| Yelp | [Zip](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/yelp_data.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)|
+| Yahoo | [Zip](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/yahoo_data.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D) |
+| Snli | [Zip](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/snli_data.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D) |
+| Debug | [Zip](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/debug_data.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D) |
+
+
diff --git a/Optimus/doc/env.md b/Optimus/doc/env.md
new file mode 100644
index 0000000000000000000000000000000000000000..993f274c7b746860849072fb42557d108a448b74
--- /dev/null
+++ b/Optimus/doc/env.md
@@ -0,0 +1,22 @@
+# Set up the Environment
+
+Pull docker from Docker Hub at: `chunyl/pytorch-transformers:v2`, and run it using the following script:
+
+
+```
+SCRIPTPATH="/home/chunyl/azure_mounts/optimus_azure"
+IMAGE=chunyl/pytorch-transformers:v2
+
+docker run \
+--runtime=nvidia \
+-it --rm \
+--net host \
+--volume $SCRIPTPATH:/workspace \
+--interactive --tty $IMAGE /bin/bash
+
+```
+
+
+There is an example at `code/scripts/scripts_docker/run_docker.sh`. Please edit the project path to the absolute path on your computer by changing the "SCRIPTPATH", then run the docker at the the directory "code":
+
+    sh scripts/scripts_docker/run_docker.sh
diff --git a/Optimus/doc/figs/headfig_optimus.png b/Optimus/doc/figs/headfig_optimus.png
new file mode 100644
index 0000000000000000000000000000000000000000..5c81f218db30695d5a04f5755dd16eb9da036476
Binary files /dev/null and b/Optimus/doc/figs/headfig_optimus.png differ
diff --git a/Optimus/doc/figs/logo_optimus.png b/Optimus/doc/figs/logo_optimus.png
new file mode 100644
index 0000000000000000000000000000000000000000..3c84c2fa1bf2b58499621fbb0ca574dc06402936
Binary files /dev/null and b/Optimus/doc/figs/logo_optimus.png differ
diff --git a/Optimus/doc/figs/optimus_scheme.png b/Optimus/doc/figs/optimus_scheme.png
new file mode 100644
index 0000000000000000000000000000000000000000..8cf5c0e45d89f512c752722477d22c14a3d060d3
Binary files /dev/null and b/Optimus/doc/figs/optimus_scheme.png differ
diff --git a/Optimus/doc/optimius_for_snli.md b/Optimus/doc/optimius_for_snli.md
new file mode 100644
index 0000000000000000000000000000000000000000..44226e5fbf747363aa64a588a141450ccd8987f0
--- /dev/null
+++ b/Optimus/doc/optimius_for_snli.md
@@ -0,0 +1,361 @@
+
+
+## Pre-trained Models for SNLI dataset
+_Note: We provide a series of pre-trained *Optimus* models of for different purpose, due to a trade-off between reconstruction capacity and prior regularization._
+
+```bash
+wget https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/$MODEL_DIR/$MODEL_NAME.zip
+unzip $MODEL_NAME.zip -d $MODEL_NAME
+```
+`MODEL_DIR` and `MODEL_NAME` could be different values. We currently release the following models;
+
+
+Play with our [`demo`](http://40.71.23.172:8899/), including sentence interpolation and analogy.
+
+## A model with good latent space manipulation performance on SNLI dataset. 
+
+To download a model checkpoint with different beta values in VAE, please download the following link. The checkpoints are on Azure Storage Blob, a SAS with Read permission is used.
+
+| Beta    | Checkpoint |
+| -------- | ------- |
+| 1.0  | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)    |
+| 0.5  | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/LM/Snli/768/philly_vae_snli_b0.5_d5_r00.5_ra0.25_length_weighted/checkpoint-31250.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)      |
+| 0.0  | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/LM/Snli/768/philly_vae_snli_b0.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)     |
+
+Each zip file contains three folders: `full`, `encoder` and `decoder`.
+
+```bash
+mkdir -p output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted
+mv checkpoint-31250.zip output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted
+cd output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted
+
+unzip checkpoint-31250.zip
+```
+
+
+
+### Play with user input sentences
+
+The main training script is [`run_latent_generation.py`](../code/examples/big_ae/run_latent_generation.py) and conducts the fine-tuning loop, taking the following options (among others) as arguments:
+
+- `--interact_with_user_input`: it specifies the program will take user inputs
+- `--play_mode`: Two modes are supported: [`analogy`, `interpolation`]
+- `--sent_source` and `--sent_target`: the source and target sentences to interpolate in between, or to make an analogy
+- `--num_interpolation_steps`: the number of interpolated sentences between source and target sentences 
+- `--sent_input`: the input sentence that will be re-written with the analogy specified by the source and target sentences
+- `--degree_to_target`: (float type), the degree to which the analogy will made, default value is 1.0. 
+
+Here are two examples:
+
+```
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export TRAIN_FILE=../data/datasets/debug_data/train.txt
+export TEST_FILE=../data/datasets/debug_data/test.txt
+export GPU_ID=1
+
+# analogy
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+    --dataset Debug \
+    --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --output_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 31250 \
+    --block_size 100 \
+    --max_seq_length 100 \
+    --latent_size 768 \
+    --interact_with_user_input \
+    --play_mode analogy \
+    --sent_source="a yellow cat likes to chase a long string ." \
+    --sent_target="a yellow cat likes to chase a short string ." \
+    --sent_input="a brown dog likes to eat long pasta ." \
+    --degree_to_target=1.0
+    
+# interpolation    
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+    --dataset Debug \
+    --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --output_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 31250 \
+    --block_size 100 \
+    --max_seq_length 100 \
+    --latent_size 768 \
+    --interact_with_user_input \
+    --play_mode interpolation \
+    --sent_source="a yellow cat likes to chase a short string ." \
+    --sent_target="a brown dog likes to eat his food very slowly ." \
+    --num_interpolation_steps=10
+
+```
+_Acknowledgement: the user interaction mode is updated with the suggestion from [summerstay](https://github.com/summerstay), in an issue [thread](https://github.com/ChunyuanLI/Optimus/issues/4)_
+
+### Play with the my debugging dataset, without user inputs
+
+Interpolation
+
+```bash
+# interpolation
+
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export TRAIN_FILE=../data/datasets/debug_data/train.txt
+export TEST_FILE=../data/datasets/debug_data/test.txt
+export GPU_ID=1
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+    --dataset Debug \
+    --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --output_dir=../output/LM/Snli/768/philly_vae_snli_b1.0_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 31250 \
+    --block_size 100 \
+    --max_seq_length 100 \
+    --latent_size 768 \
+    --play_mode interpolation \
+    --num_interpolation_steps 10
+
+```
+
+
+
+Reconstruction
+ 
+```bash
+# reconstrction
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_latent_generation.py \
+    --dataset Debug \
+    --checkpoint_dir=../output/LM/Snli/768/philly_vae_snli_b0.5_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --output_dir=../output/LM/Snli/768/philly_vae_snli_b0.5_d5_r00.5_ra0.25_length_weighted/checkpoint-31250 \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --per_gpu_eval_batch_size=1 \
+    --gloabl_step_eval 31250 \
+    --block_size 100 \
+    --max_seq_length 100 \
+    --latent_size 768 \
+    --play_mode reconstrction
+```
+
+Please see the scripts I used to run the evaluation at [code/scripts/scripts_local/eval_optimus_latent_space.sh](../code/scripts/scripts_local/eval_optimus_latent_space.sh). Here are some results you can see from the model:
+
+
+
+#### When beta changes from 0 to 1, the reconstruction quality become worse
+
+Reconstruction (beta = 0.0)
+```
+a football coach putting his arm on one of his player's shoulder.  
+ a football player putting his arm on one of his coach's shoulder.
+
+ a girl wearing a white shirt and blue jeans jumping off a rock into the sand.  
+ a girl wearing a white shirt and blue jeans jumping off a rock into the sand.
+
+ a group of cheerleaders are making a human pyramid at a basketball game.  
+ a group of cheerleaders are making a human pyramid at a basketball game.
+
+ a man is looking over things that are in a local market.  
+ a man is looking over things that are in a local market.
+
+ a woman is riding a moped on a street with many other people riding mopeds behind her, while streamers and banners hang in the trees overhead.  
+ a woman is riding a moped on a street with large trees and other speakers as she rides in front of a turntable carrying camels in the background.
+
+ group of people in the wilderness packing boxes full of food.  
+ group of people in the wilderness packing boxes full of food.
+
+ man tries to impress girl by diving into the water.  
+ man tries to impress girl by diving into the water.
+
+ people in the middle of city street surrounded by large buildings.  
+ people in the middle of city street surrounded by large buildings.
+
+ there are girls that are on the ice practicing their ice dancing.  
+ there are girls that are on the ice practicing their ice dancing.
+
+ there are two young boys with shovels who are outside and are digging in the dirt.  
+ there are two young boys with shovels and are outside being chased by the dirt.
+
+ two men in blue one standing the other hanging onto the window of a tram car just looking out.  
+ two men in blue standing over the window one of them wearing a pink bodysuit riding the other down the street.
+
+```
+
+Reconstruction (beta = 0.5)
+```
+a football coach putting his arm on one of his player's shoulder.  
+ a football player putting his arm on another's shoulder, this one hispanic.
+
+ a girl wearing a white shirt and blue jeans jumping off a rock into the sand.  
+ a girl wearing a blue shirt and white jeans jumping off a rock into the sand.
+
+ a group of cheerleaders are making a human pyramid at a basketball game.  
+ a group of cheerleaders are making a human pyramid at a basketball game.
+
+ a man is looking over things that are in a local market.  
+ a man is looking over things that are in a local market.
+
+ a woman is riding a moped on a street with many other people riding mopeds behind her, while streamers and banners hang in the trees overhead.  
+ a woman is riding a moped on a street with many other people behind her, as well as small banners and horns riding in the background.
+
+ group of people in the wilderness packing boxes full of food.  
+ group of people in the wilderness packing boxes full of food.
+
+ man tries to impress girl by diving into the water.  
+ man tries to impress girl by diving into the water.
+
+ people in the middle of city street surrounded by large buildings.  
+ people in the middle of city streets surrounded by large buildings.
+
+ there are girls that are on the ice practicing their ice dancing.  
+ there are girls that are on the ice practicing their ice dancing.
+
+ there are two young boys with shovels who are outside and are digging in the dirt.  
+ there are two young boys with shovels who are outside and are digging in the dirt.
+
+ two men in blue one standing the other hanging onto the window of a tram car just looking out.  
+ two men in all blue standing the windowless car next to another man riding a blue roller coaster.
+```
+
+
+
+Reconstruction (beta = 1.0)
+```
+a football coach putting his arm on one of his player's shoulder.  
+ a football player extending his hand on the team football.
+
+ a girl wearing a white shirt and blue jeans jumping off a rock into the sand.  
+ a girl wearing a blue shirt and blue jeans jumping off the rock into the ocean.
+
+ a group of cheerleaders are making a human pyramid at a basketball game.  
+ a group of girls are creating a purple basketball <unk> at a giant auditorium.
+
+ a man is looking over things that are in a local market.  
+ a man is looking over things in a local marketplace, looking very busy.
+
+ a woman is riding a moped on a street with many other people riding mopeds behind her, while streamers and banners hang in the trees overhead.  
+ a woman is riding a bike on a grassy lot with other people, riding flag poles and asian flags in front of it.
+
+ group of people in the wilderness packing boxes full of food.  
+ group of people in the packing room packing stuff.
+
+ man tries to impress girl by diving into the water.  
+ man tries to impress the girl by swimming underwater.
+
+ people in the middle of city street surrounded by large buildings.  
+ people in high, urban city blocks surrounding the building.
+
+ there are girls that are on the ice practicing their ice dancing.  
+ there are two girls who are practicing the ice skating in the snow.
+
+ there are two young boys with shovels who are outside and are digging in the dirt.  
+ there are two young children with shovels and shovels in the dirt.
+
+ two men in blue one standing the other hanging onto the window of a tram car just looking out.  
+ two men in yellow overalls standing next to the wheel of a blue car while they look on.
+
+```
+
+
+
+
+### When beta changes from 0 to 1, similar interpolation quality are observed:
+
+Interpolation (beta = 0.0)
+```
+0 
+ a woman is riding a moped on a street with large trees and other speakers as she rides in front of a turntable carrying camels in the background.
+1 
+ a woman is riding a moped on the street with several other people riding mopeds and lights behind him, while passersby in the background draw pictures.
+2 
+ a woman riding a scooter is riding on two large streets behind her, one with a woman singing in the background behind them.
+3 
+ a woman on a pony is riding over the street holding several mopeds and wagons in front of a large, white painted building as other people are watching.
+4 
+ a man with two ponytails riding on the street is riding down an empty stage as another person in a white hooded jacket watches.
+5 
+ one man in a blue tuxedo is riding over the street holding two people on it as they ride a light brown horse.
+6 
+ one man in a blue hoodie riding a cart is standing over the street while others view it on the other side.
+7 
+ two men on one side of the street holding a blue balloon as they ride wagons moving past the building.
+8 
+ two men in blue sitting on the roof of a car that is blowing up another one leaning very close.
+9 
+ two men in yellow one standing on the window holding a blue car trying to ride it down the street.
+10 
+ two men in the blue one window holding onto the car are jumping over another man walking down the street.
+```
+
+Interpolation (beta = 0.5)
+```
+0 
+ a woman is riding a moped on a street with many other people behind her, as well as small banners and horns riding in the background.
+1 
+ a woman is riding a moped with several people on it behind her, riding a straw pole and streets in the background.
+2 
+ a woman riding a trolley is riding on a street with many people in front of them, as well as ripples surrounding it.
+3 
+ a woman riding a moped has two others standing in the street on side a bus as they weave, blowing bubbles on it.
+4 
+ one man riding a black stroller is riding on the street beside a man with painted windows and others populating in the background.
+5 
+ one man riding a pink bus is standing on the street behind another man making wheelie figures and the window in between them.
+6 
+ two men in a blue hoodie standing on one of the cars drives past people hanging a wicker window of the street.
+7 
+ two men in blue holding a wheelie standing on the street that are both winking into the windows next to them.
+8 
+ two men in pink standing on the street one of which is pulling a blue parasol to window it.
+9 
+ two men in the blue one person standing on a car leaning over it window dreaming of the other floating.
+10 
+ two men in blue that was standing next to the window holding one wheelie riding a black bike down the street.
+
+```
+
+Interpolation (beta = 1.0)
+```
+0 
+a woman is riding a moped on a street with large trees and other riders ride over it as some sort of bob lights are passing behind her.
+1 
+ a woman is riding a moped on a street with several buildings that run in front of and polluting the behind her ears.
+2 
+ a woman is riding a small pig in front of a street with billows and speakers that lead to the windowspan on them.
+3 
+ a woman on a street riding mule is holding two wheels in front of a large white tulips while it sews the air behind them.
+4 
+ a man in a ponytail is riding two of the houses visible on the street while hanging traffic cones over them.
+5 
+ one man riding a van in front of the street is shining blue curtains, while another man holding on to the moped wires.
+6 
+ one man in a blue shirt and black riding a windowless cart are riding over the others next to them hanging on the river.
+7 
+ two men in yellow riding a wave just standing on the side of the building keeping track of another man.
+8 
+ two men in matching blue shirt standing on the roof of a car one it reading a trolley car.
+9 
+ two men in the blue one car standing on the window dreaming of hanging another car passing by.
+10 
+ two men in blue holding each other standing the window of a one wheel bicycle trying out the tube.
+
+```
diff --git a/Optimus/doc/optimus_finetune_language_models.md b/Optimus/doc/optimus_finetune_language_models.md
new file mode 100644
index 0000000000000000000000000000000000000000..a3089a63f97f295dd1c14bb44b8895b9fdf4448d
--- /dev/null
+++ b/Optimus/doc/optimus_finetune_language_models.md
@@ -0,0 +1,52 @@
+# Fine-tuning Optimus on a VAE language modeling task
+
+_Note: The latent vector size has a great impact on the model performance: small latent size provides a tight information bottleneck, often yielding low reconstruction quality; In contrast, large latent size shows good reconstruction quality, but would be hard to control the latent manipulation with vector operators if the latent size is too large. To have a fair comparison with existing works, we use a 32-dimensional latent vector for the experiments on language modeling._ 
+
+##### Download a pre-trained model (pre-trained from Wikipedia). 
+
+
+| Beta    | Latent size | Checkpoint |
+| -------- | ------- | ------- |
+| 0.0  | 32 | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)    |
+| 0.5  | 32 | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.5_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)      |
+| 0.0  | 768 | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_768_v2/checkpoint-508523.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)     |
+| 0.5  | 768 | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.5_d1.0_ro0.5_ra0.25_768_v2/checkpoint-508523.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)  |
+| 1.0  | 768 | [Checkpoint](https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta1.0_d1.0_ro0.5_ra0.25_768_v2/checkpoint-508523.zip?sp=r&st=2023-08-28T00:40:43Z&se=3023-08-28T08:40:43Z&sv=2022-11-02&sr=c&sig=kUkSFqeHFfTeqxxpvqVdICCJupwODFwJprCAW2o4irE%3D)  |
+
+
+
+```
+export PYTHONPATH="${PYTHONPATH}:/workspace/code"
+export GPU_ID=0,1
+
+export TRAIN_FILE=../data/datasets/snli_data/train.txt
+export TEST_FILE=../data/datasets/snli_data/test.txt
+
+CUDA_VISIBLE_DEVICES=$GPU_ID python examples/big_ae/run_lm_vae_training.py \
+    --output_dir=../output/LM/Snli/local_lm_vae_snli_optimus \
+    --dataset Snli \
+    --encoder_model_type=bert \
+    --encoder_model_name_or_path=bert-base-cased \
+    --decoder_model_type=gpt2 \
+    --decoder_model_name_or_path=gpt2 \
+    --beta 1.0 \
+    --ratio_zero 0.5 \
+    --ratio_increase 0.25 \
+    --do_train \
+    --do_eval \
+    --fb_mode 1 \
+    --dim_target_kl 0.5\
+    --train_data_file=$TRAIN_FILE \
+    --eval_data_file=$TEST_FILE \
+    --num_train_epochs 1.0 \
+    --save_steps 1000 \
+    --logging_steps 1000 \
+    --overwrite_output_dir \
+    --per_gpu_train_batch_size=5 \
+    --block_size 100 \
+    --length_weighted_loss \
+    --use_pretrained_model \
+    --use_pretrained_vae \
+    --checkpoint_dir ../output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523 \
+    --gloabl_step_eval 508523
+```
diff --git a/Optimus/download_datasets.md b/Optimus/download_datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..382b42b98c5a1cb2cf0043272d041af654d7a7ff
--- /dev/null
+++ b/Optimus/download_datasets.md
@@ -0,0 +1,40 @@
+# Download/Pre-process Datasets
+
+## Wikipedia
+
+Download processed files (11.78G) below, and unzip it (298 files)
+
+https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/wikipedia_json_64_filtered.zip
+
+Download raw file (11.79G):
+
+https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/wikipedia.segmented.nltk.txt
+
+Our pre-processing protocal: We split the original wiki text into 298 files, and loop over files in one epoch.
+
+We filter each sentence in wiki based on two constraints: (1) The sentence length is smaller than 64. (2) The tokenized sentence length is smaller than 256 (so that the encoder can take the entire sentence).
+
+To filter the sentence, please change the data folders and run the script:
+
+    sh scripts/scripts_local/run_data_filtering_wiki.sh
+
+The filtered files are saved in "data/datasets/wikipedia_json_64_filtered".
+
+
+## Fine-tuning datasets
+
+Language Modeling: Penn, Yelp, Yahoo, Snli
+
+https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/penn_data.zip
+https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/datasets/yelp_data.zip
+https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/yahoo_data.zip
+https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/data/datasets/snli_data.zip
+
+(Stylized) Dialog response generation: DailyDialog, Holmes
+
+
+Label-conditional text generation: Yelp.
+
+
+Language Understanding: GLUE, Yelp.
+
diff --git a/app.py b/app.py
new file mode 100644
index 0000000000000000000000000000000000000000..857e5f71843a606b0121d4b9302c85e00fdd06d3
--- /dev/null
+++ b/app.py
@@ -0,0 +1,342 @@
+# -*- coding: utf-8 -*-
+"""message_bottle.ipynb
+
+Automatically generated by Colab.
+
+Original file is located at
+    https://colab.research.google.com/drive/1I47sLakpuwERGzn-XoNct67mwiDS1mQD
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+torch.set_float32_matmul_precision('high')
+
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+class BottleneckT5Autoencoder:
+    def __init__(self, model_path: str, device='cuda'):
+        self.device = device
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512, torch_dtype=torch.bfloat16)
+        self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
+        self.model.eval()
+        # self.model = torch.compile(self.model)
+
+
+    def embed(self, text: str) -> torch.FloatTensor:
+        inputs = self.tokenizer(text, return_tensors='pt', padding=True).to(self.device)
+        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
+        return self.model(
+            **inputs,
+            decoder_input_ids=decoder_inputs['input_ids'],
+            encode_only=True,
+        )
+
+    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1., top_p=.8, length_penalty=10, min_new_tokens=30) -> str:
+        dummy_text = '.'
+        dummy = self.embed(dummy_text)
+        perturb_vector = latent - dummy
+        self.model.perturb_vector = perturb_vector
+        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
+        output = self.model.generate(
+            input_ids=input_ids,
+            max_length=max_length,
+            do_sample=True,
+            temperature=temperature,
+            top_p=top_p,
+            num_return_sequences=1,
+            length_penalty=length_penalty,
+            min_new_tokens=min_new_tokens,
+            # num_beams=8,
+        )
+        return self.tokenizer.decode(output[0], skip_special_tokens=True)
+
+autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-xl-wikipedia')
+
+
+import gradio as gr
+import numpy as np
+from sklearn.svm import SVC
+from sklearn.inspection import permutation_importance
+from sklearn import preprocessing
+import pandas as pd
+import random
+import time
+
+
+dtype = torch.bfloat16
+torch.set_grad_enabled(False)
+
+prompt_list = [p for p in list(set(
+                pd.read_csv('./twitter_prompts.csv').iloc[:, 1].tolist())) if type(p) == str]
+
+start_time = time.time()
+
+####################### Setup Model
+
+# TODO put back
+# @spaces.GPU()
+def generate(prompt, in_embs=None,):
+  if prompt != '':
+    print(prompt)
+    in_embs = in_embs / in_embs.abs().max() * .15 if in_embs != None else None
+    in_embs = .9 * in_embs.to('cuda') + .5 * autoencoder.embed(prompt).to('cuda') if in_embs != None else autoencoder.embed(prompt).to('cuda')
+  else:
+    print('From embeds.')
+  in_embs = in_embs / in_embs.abs().max() * .15
+  text = autoencoder.generate_from_latent(in_embs.to('cuda'), temperature=.3, top_p=.99, min_new_tokens=5)
+  in_embs = autoencoder.embed(prompt)
+  return text, in_embs.to('cpu')
+
+
+#######################
+
+# TODO add to state instead of shared across all
+glob_idx = 0
+
+def next_one(embs, ys, calibrate_prompts):
+    global glob_idx
+    glob_idx = glob_idx + 1
+
+    with torch.no_grad():
+        if len(calibrate_prompts) > 0:
+            print('######### Calibrating with sample prompts #########')
+            prompt = calibrate_prompts.pop(0)
+            print(prompt)
+            text, img_embs = generate(prompt)
+            embs += img_embs
+            print(len(embs))
+            return text, embs, ys, calibrate_prompts
+        else:
+            print('######### Roaming #########')
+
+
+            # handle case where every instance of calibration prompts is 'Neither' or 'Like' or 'Dislike'
+            if len(list(set(ys))) <= 1:
+                embs.append(.01*torch.randn(2048))
+                embs.append(.01*torch.randn(2048))
+                ys.append(0)
+                ys.append(1)
+            if len(list(ys)) < 10:
+                embs += [.01*torch.randn(2048)] * 3
+                ys += [0] * 3
+
+            pos_indices = [i for i in range(len(embs)) if ys[i] == 1]
+            neg_indices = [i for i in range(len(embs)) if ys[i] == 0]
+
+            # the embs & ys stay tied by index but we shuffle to drop randomly
+            random.shuffle(pos_indices)
+            random.shuffle(neg_indices)
+
+            #if len(pos_indices) - len(neg_indices) > 48 and len(pos_indices) > 80:
+            #    pos_indices = pos_indices[32:]
+            if len(neg_indices) - len(pos_indices) > 48/16 and len(pos_indices) > 6:
+                pos_indices = pos_indices[5:]
+            if len(neg_indices) - len(pos_indices) > 48/16 and len(neg_indices) > 6:
+                neg_indices = neg_indices[5:]
+
+
+            if len(neg_indices) > 25:
+                neg_indices = neg_indices[1:]
+
+            print(len(pos_indices), len(neg_indices))
+            indices = pos_indices + neg_indices
+
+            embs = [embs[i] for i in indices]
+            ys = [ys[i] for i in indices]
+
+
+            indices = list(range(len(embs)))
+
+            # also add the latest 0 and the latest 1
+            has_0 = False
+            has_1 = False
+            for i in reversed(range(len(ys))):
+                if ys[i] == 0 and has_0 == False:
+                    indices.append(i)
+                    has_0 = True
+                elif ys[i] == 1 and has_1 == False:
+                    indices.append(i)
+                    has_1 = True
+                if has_0 and has_1:
+                    break
+
+            # we may have just encountered a rare multi-threading diffusers issue (https://github.com/huggingface/diffusers/issues/5749);
+            # this ends up adding a rating but losing an embedding, it seems.
+            # let's take off a rating if so to continue without indexing errors.
+            if len(ys) > len(embs):
+                print('ys are longer than embs; popping latest rating')
+                ys.pop(-1)
+
+            feature_embs = np.array(torch.stack([embs[i].to('cpu') for i in indices]).to('cpu'))
+            scaler = preprocessing.StandardScaler().fit(feature_embs)
+            feature_embs = scaler.transform(feature_embs)
+            chosen_y = np.array([ys[i] for i in indices])
+
+            print('Gathering coefficients')
+            lin_class = SVC(max_iter=50000, kernel='linear', class_weight='balanced', C=.1).fit(feature_embs, chosen_y)
+            coef_ = torch.tensor(lin_class.coef_, dtype=torch.double)
+            coef_ = coef_ / coef_.abs().max() * 3
+            print(coef_.shape, 'COEF')
+            print('Gathered')
+
+            rng_prompt = random.choice(prompt_list)
+            w = 1# if len(embs) % 2 == 0 else 0
+            im_emb = w * coef_.to(dtype=dtype)
+
+            prompt= '' if glob_idx % 3 != 0 else rng_prompt
+            text, im_emb = generate(prompt, im_emb)
+            embs += im_emb
+
+
+            return text, embs, ys, calibrate_prompts
+
+
+
+
+
+
+
+
+
+def start(_, embs, ys, calibrate_prompts):
+    text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+    return [
+            gr.Button(value='Like (L)', interactive=True),
+            gr.Button(value='Neither (Space)', interactive=True),
+            gr.Button(value='Dislike (A)', interactive=True),
+            gr.Button(value='Start', interactive=False),
+            text,
+            embs,
+            ys,
+            calibrate_prompts
+            ]
+
+
+def choose(text, choice, embs, ys, calibrate_prompts):
+    if choice == 'Like (L)':
+        choice = 1
+    elif choice == 'Neither (Space)':
+        embs = embs[:-1]
+        text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+        return text, embs, ys, calibrate_prompts
+    else:
+        choice = 0
+
+    # if we detected NSFW, leave that area of latent space regardless of how they rated chosen.
+    # TODO skip allowing rating
+    if text == None:
+        print('NSFW -- choice is disliked')
+        choice = 0
+
+    ys += [choice]*1
+    text, embs, ys, calibrate_prompts = next_one(embs, ys, calibrate_prompts)
+    return text, embs, ys, calibrate_prompts
+
+css = '''.gradio-container{max-width: 700px !important}
+#description{text-align: center}
+#description h1, #description h3{display: block}
+#description p{margin-top: 0}
+.fade-in-out {animation: fadeInOut 3s forwards}
+@keyframes fadeInOut {
+    0% {
+      background: var(--bg-color);
+    }
+    100% {
+      background: var(--button-secondary-background-fill);
+    }
+}
+'''
+js_head = '''
+<script>
+document.addEventListener('keydown', function(event) {
+    if (event.key === 'a' || event.key === 'A') {
+        // Trigger click on 'dislike' if 'A' is pressed
+        document.getElementById('dislike').click();
+    } else if (event.key === ' ' || event.keyCode === 32) {
+        // Trigger click on 'neither' if Spacebar is pressed
+        document.getElementById('neither').click();
+    } else if (event.key === 'l' || event.key === 'L') {
+        // Trigger click on 'like' if 'L' is pressed
+        document.getElementById('like').click();
+    }
+});
+function fadeInOut(button, color) {
+  button.style.setProperty('--bg-color', color);
+  button.classList.remove('fade-in-out');
+  void button.offsetWidth; // This line forces a repaint by accessing a DOM property
+
+  button.classList.add('fade-in-out');
+  button.addEventListener('animationend', () => {
+    button.classList.remove('fade-in-out'); // Reset the animation state
+  }, {once: true});
+}
+document.body.addEventListener('click', function(event) {
+    const target = event.target;
+    if (target.id === 'dislike') {
+      fadeInOut(target, '#ff1717');
+    } else if (target.id === 'like') {
+      fadeInOut(target, '#006500');
+    } else if (target.id === 'neither') {
+      fadeInOut(target, '#cccccc');
+    }
+});
+
+</script>
+'''
+
+with gr.Blocks(css=css, head=js_head) as demo:
+    gr.Markdown('''# Compass
+### Generative Recommenders for Exporation of Text
+
+Explore the latent space without prompting based on your preferences. Learn more on [the write-up](https://rynmurdock.github.io/posts/2024/3/generative_recomenders/).
+    ''', elem_id="description")
+    embs = gr.State([])
+    ys = gr.State([])
+    calibrate_prompts = gr.State([
+    'the moon is melting into my glass of tea',
+    'a sea slug -- pair of claws scuttling -- jelly fish glowing',
+    'an adorable creature. It may be a goblin or a pig or a slug.',
+    'an animation about a gorgeous nebula',
+    'a sketch of an impressive mountain by da vinci',
+    'a watercolor painting: the octopus writhes',
+    ])
+    def l():
+        return None
+
+    with gr.Row(elem_id='output-image'):
+        text = gr.Textbox(interactive=False, elem_id="text")
+    with gr.Row(equal_height=True):
+        b3 = gr.Button(value='Dislike (A)', interactive=False, elem_id="dislike")
+        b2 = gr.Button(value='Neither (Space)', interactive=False, elem_id="neither")
+        b1 = gr.Button(value='Like (L)', interactive=False, elem_id="like")
+        b1.click(
+        choose,
+        [text, b1, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+        b2.click(
+        choose,
+        [text, b2, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+        b3.click(
+        choose,
+        [text, b3, embs, ys, calibrate_prompts],
+        [text, embs, ys, calibrate_prompts]
+        )
+    with gr.Row():
+        b4 = gr.Button(value='Start')
+        b4.click(start,
+                 [b4, embs, ys, calibrate_prompts],
+                 [b1, b2, b3, b4, text, embs, ys, calibrate_prompts])
+    with gr.Row():
+        html = gr.HTML('''<div style='text-align:center; font-size:20px'>You will calibrate for several prompts and then roam. </ div><br><br><br>
+<div style='text-align:center; font-size:14px'>Note that while the model is unlikely to produce NSFW text, this may still occur, and users should avoid NSFW content when rating.
+</ div>
+<br><br>
+<div style='text-align:center; font-size:14px'>Thanks to @multimodalart for their contributions to the demo, esp. the interface and @maxbittker for feedback.
+</ div>''')
+
+demo.launch(share=True)
diff --git a/checkpoint-31250/checkpoint-decoder-31250/config.json b/checkpoint-31250/checkpoint-decoder-31250/config.json
new file mode 100755
index 0000000000000000000000000000000000000000..3b0aad099b6837e2f7cad437da5da6de6424f6ad
--- /dev/null
+++ b/checkpoint-31250/checkpoint-decoder-31250/config.json
@@ -0,0 +1,28 @@
+{
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "embd_pdrop": 0.1,
+  "finetuning_task": null,
+  "initializer_range": 0.02,
+  "latent_size": 768,
+  "layer_norm_epsilon": 1e-05,
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "num_labels": 1,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "pruned_heads": {},
+  "resid_pdrop": 0.1,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "torchscript": false,
+  "vocab_size": 50260
+}
diff --git a/checkpoint-31250/checkpoint-decoder-31250/pytorch_model.bin b/checkpoint-31250/checkpoint-decoder-31250/pytorch_model.bin
new file mode 100755
index 0000000000000000000000000000000000000000..5923dae6b9f11890d8cbba6cc26caf661c8f3683
--- /dev/null
+++ b/checkpoint-31250/checkpoint-decoder-31250/pytorch_model.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:44191a9d774bb47ee02b1fbe38769fc4a68f25e373ef13c7b14b0fb7a721c8ed
+size 578805986
diff --git a/checkpoint-31250/checkpoint-decoder-31250/training_decoder_args.bin b/checkpoint-31250/checkpoint-decoder-31250/training_decoder_args.bin
new file mode 100755
index 0000000000000000000000000000000000000000..d4886c732aba1747a4c9eb25aaae4d53577ef368
--- /dev/null
+++ b/checkpoint-31250/checkpoint-decoder-31250/training_decoder_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1a79ba5f6ca41deaf9b59596e4f06f9969dd15d3b887a93c03d7c3cb5584e00a
+size 2338
diff --git a/checkpoint-31250/checkpoint-encoder-31250/config.json b/checkpoint-31250/checkpoint-encoder-31250/config.json
new file mode 100755
index 0000000000000000000000000000000000000000..03cdc9bf6b5ae24b254dd2e85d9acfae675bd3df
--- /dev/null
+++ b/checkpoint-31250/checkpoint-encoder-31250/config.json
@@ -0,0 +1,23 @@
+{
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "finetuning_task": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "num_labels": 2,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "pruned_heads": {},
+  "torchscript": false,
+  "type_vocab_size": 2,
+  "vocab_size": 28996
+}
diff --git a/checkpoint-31250/checkpoint-encoder-31250/pytorch_model.bin b/checkpoint-31250/checkpoint-encoder-31250/pytorch_model.bin
new file mode 100755
index 0000000000000000000000000000000000000000..7075e822be053c5e7d5cde45d43e9db054c6b01d
--- /dev/null
+++ b/checkpoint-31250/checkpoint-encoder-31250/pytorch_model.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:20cbd92cce963a406748b2b0532c1a00f2f8bb12da9340f5978640bbf24b42e2
+size 438007669
diff --git a/checkpoint-31250/checkpoint-encoder-31250/training_encoder_args.bin b/checkpoint-31250/checkpoint-encoder-31250/training_encoder_args.bin
new file mode 100755
index 0000000000000000000000000000000000000000..d4886c732aba1747a4c9eb25aaae4d53577ef368
--- /dev/null
+++ b/checkpoint-31250/checkpoint-encoder-31250/training_encoder_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1a79ba5f6ca41deaf9b59596e4f06f9969dd15d3b887a93c03d7c3cb5584e00a
+size 2338
diff --git a/checkpoint-31250/checkpoint-full-31250/training.bin b/checkpoint-31250/checkpoint-full-31250/training.bin
new file mode 100755
index 0000000000000000000000000000000000000000..50cda4a6ade2fd4f2809d5dc8691a7f33ab47d65
--- /dev/null
+++ b/checkpoint-31250/checkpoint-full-31250/training.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5a5bfb0a931df6f22904c32a93af0c508a3af501c0410f6a3a0e69a711814e33
+size 2949730416
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c6c90367e9800dd36da333be580592cb00639fd4
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,13 @@
+gradio
+numpy
+scikit-learn
+pandas
+torch
+numpy
+diffusers
+accelerate
+transformers
+sentencepiece
+peft
+tensorflow_hub
+tensorflow==2.14.0
diff --git a/twitter_prompts.csv b/twitter_prompts.csv
new file mode 100644
index 0000000000000000000000000000000000000000..569a05484f4782ecceb5c9a988fd7759aa6e9929
--- /dev/null
+++ b/twitter_prompts.csv
@@ -0,0 +1,2088 @@
+,0
+0,Persephone
+1,"A portrait: man, whose lineage is corpse."
+2,a beautiful Waluigi
+3,president abe lincoln but a cat
+4,a woman and a crow
+5,"A professional, minimalist poster for the book The Old Man and the Sea"
+6,"half Ryan, half pigeon"
+7,Easter cat
+8,a beautiful woman
+9,a cherry tree made of fractals
+10,a christmas card from the victorian era
+11,The Theotokos is a bird
+12,
+13,A short life full of immense joy
+14,a character from a ghibli movie
+15,A structure made of people standing on top of other people
+16,зеленая собака
+17,a painting of the city
+18,a character from a ghibli movie
+19,pasta ömetabolism
+20,"a brilliant sketch titled ""Let Forever be Delayed"""
+21,the sun is shining on the lake
+22,Monet Lisa
+23,Genesis
+24,Synesthesia
+25,A dead man
+26,a cherry tree made of fractals
+27,a tasteful nude
+28,The First Supper
+29,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+30,a painting of the last day
+31,Dead Codes by Ryan Murdock
+32,Genesis
+33,symmetry
+34,The OLD DATA
+35,a beautiful person
+36,the whitest man
+37,Death is a black camel that kneels down so we can ride
+38,a goblin by van gogh
+39,a portrait of a beautiful person
+40,a famous painted portrait of Lady Macbeth
+41,on the edge of grace
+42,"""A God Made of Wires and Dust"" by Ryan Murdock"
+43,symmetry
+44,a beautiful person
+45,"If we're not careful, it's only art about not-quite-dead pigs from now on."
+46,Beauty here -- a photograph by Ryan Murdock
+47,Hunger art by r.j. Murdock
+48,"A professional, minimalist poster for the film Donnie Darko"
+49,A black and white photo of a rainbow.
+50,a beautiful painting
+51,Monet Lisa
+52,a painting of the city
+53,A structure made of people standing on top of other people
+54,a criminal
+55,a cherry tree made of fractals
+56,Persephone flees Hades
+57,a tree with weaping branches
+58,a tree with weaping branches
+59,Genesis
+60,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+61,a cute cat
+62,Aflame
+63,A cat wearing a tophat
+64,a terrifying night hag
+65,a beautiful woman
+66,Fire
+67,a cherry tree made of fractals
+68,The EcoCathedral
+69,a man on fire
+70,A structure made of people standing on top of other people
+71,totemic dusk
+72,The Death of Achilles
+73,Everywhere is no-place
+74,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+75,An Arundel Tomb
+76,The average Advadnoun twitter follower
+77,I can read when there's writing on the wall
+78,
+79,A Tragedy
+80,Breathe deep the fumes at Delphi
+81,a pOrTRaIT Of tHe SpOngeBOb CHicKen
+82,a portrait of a beautiful person
+83,a beautiful person
+84,a portrait of a beautiful person
+85,Dead Codes by Ryan Murdock
+86,a photo of a purple dog
+87,Memento Mori
+88,"joy, happiness, bliss"
+89,Paradise Lost
+90,a beautiful person
+91,melancholia
+92,Monet Lisa
+93,"Of that which one cannot speak, one must be silent."
+94,
+95,Juliet
+96,God killed Van Gogh.
+97,a cherry tree made of fractals
+98,a horse with four eyes.
+99,a beautiful person
+100,With the Gods in envy of their visions
+101,The Lost Generation
+102,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+103,a portrait of a beautiful person
+104,"half Ryan, half pigeon"
+105,a ginormous baby
+106,a wormhole
+107,Ophelia
+108,"""The hunger artist, full"" by Ryan Murdock"
+109,I will meet you in a field firmly set within wrong.nnBy Ryan Murdock
+110,"Intricate, Weeping Tree by Ryan Murdock"
+111,everything was beautiful and nothing hurt
+112,Saturn being a good dad to his son
+113,The years gild our memoriesnUnfairly.
+114,Intimations of Immortality
+115,meaningless neko ♡♡ neko
+116,chiaroscuro
+117,The Patron Saint of Evil
+118,a portrait of a beautiful person
+119,"Mephisto, shrouded in smoke"
+120,everything was beautiful and nothing hurt
+121,God killed Van Gogh.
+122,a man wearing makeup
+123,Everywhere is no-place
+124,🔴~__��'t �
+125,a beautiful waluigi
+126,a beautiful woman
+127,a portrait of a beautiful person
+128,/
+129,a green doG
+130,Dead Codes by Ryan Murdock
+131,I miss the Spring
+132,
+133,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+134,a woman and a crow
+135,a photo from {my hometown}
+136,Summer's Symphony: Counterpoint and Melody
+137,a cute cat
+138,"God, it's amazing."
+139,a painting of a sycamore in
+140,distinguished leaves decorated
+141,I do not think they'll sing for me
+142,the monet lisa
+143,a portrait of Abraham Lincoln
+144,The average Advadnoun twitter follower
+145,Dancing in the moonlight
+146,Shinji Ikari
+147,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+148,/
+149,is this loss? but it's van gogh
+150,Shinji Ikari
+151,a portrait of Juliet
+152,A sticky-note magnum opus featuring birds
+153,a silent palace
+154,"""A new hope blooms on the long notes of old horns."""
+155,The things I'll take with me
+156,is this loss? but it's van gogh
+157,a beautiful haunting
+158,Summer's Symphony: Counterpoint and Melody
+159,зеленая собака
+160,Last Breath
+161,Last Breath
+162,a cherry tree made of fractals
+163,The Theotokos is a bird
+164,a man holding an apple in one hand
+165,a beautiful person
+166,Monet Lisa
+167,A baroque portrait of Hamlet
+168,A gun killed Van Gogh.
+169,totemic dusk
+170,a portrait of a beautiful person
+171,pasta ömetabolism
+172,a beautiful person
+173,Taylor Swift
+174,colorful rabbits chandelier polaroid
+175,Dancing in the moonlight
+176,I will meet you in a field firmly set within wrong.nnBy Ryan Murdock
+177,symmetry
+178,"""Your mind flls in the gaps"" - by Ryan Murdock"
+179,the moon is a sickle cell
+180,"joy, happiness, bliss"
+181,Beauty here -- a photograph by Ryan Murdock
+182,a beautiful person
+183,a photo of a purple dog
+184,A propaganda poster promoting big chungus
+185,a beautiful person
+186,a tree with weaping branches
+187,A gun killed Van Gogh.
+188,"""A new hope blooms on the long notes of old horns."""
+189,a portrait of Abe Lincoln
+190,"""I love you more than the world can contain in its lonely and ramshackle head."""
+191,a character from a ghibli movie
+192,f*** it market standard rule language – distinguish np tax science research
+193,a portrait of Abe Lincoln
+194,a wholesome clown. Not creepy at all
+195,
+196,a corgi
+197,Easter cat
+198,a portrait of Abraham Lincoln
+199,a person's face
+200,A poster advertising Freudian Psychoanalytics
+201,Dancing in the moonlight
+202,Cat in a teacup
+203,a beautiful person
+204,Summer's Symphony: Counterpoint and Melody
+205,Post-Modern Nouveaux Statue
+206,a famous painted portrait of Lady Macbeth
+207,photosynthesis
+208,a photo of a purple dog
+209,
+210,a photo of Juliet
+211,The Starry Night
+212,Saturn being a good dad to his son
+213,a beautiful person
+214,In smoke and mould the fleshless dead
+215,totemic dusk
+216,a beautiful woman
+217,God killed Van Gogh.
+218,is this loss? but it's van gogh
+219,Nostos
+220,a silent palace
+221,"""The hunger artist, full"" by Ryan Murdock"
+222,a green doG
+223,Weeping Roses
+224,for sale: baby shoes; never worn
+225,a dog eating a cheese burger
+226,a man inside a cage
+227,Contentment at the Disco
+228,a photo from {my hometown}
+229,The EcoCathedral
+230,The OLD DATA
+231,treehouse in the style of studio ghibli animation
+232,
+233,"""The hunger artist, full"" by Ryan Murdock"
+234,
+235,Everywhere is no-place
+236,"A portrait: man, whose lineage is corpse."
+237,Last Breath
+238,A propaganda poster promoting big chungus
+239,зеленая собака
+240,a beautiful person
+241,Memento Mori
+242,A propaganda poster promoting big chungus
+243,is this loss?
+244,a tree with weaping branches
+245,Nostos
+246,Beauty here -- a photograph by Ryan Murdock
+247,a tiny church inside an eyeball
+248,
+249,a cherry tree made of fractals
+250,"joy, happiness, bliss"
+251,The First Supper
+252,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+253,🔴~__��'t �
+254,Dancing in the moonlight
+255,Mona Lisa
+256,"God, it's amazing."
+257,a man holding an apple in one hand
+258,Some stolen Gods take up the reigns of darkness.
+259,🔴~__��'t �
+260,Figure 5: a corgi
+261,a photo from {my hometown}
+262,Anxiety: the one emotion that does not lie
+263,In the temple of God
+264,
+265,Metaphysics
+266,a beautiful woman
+267,a beautiful woman
+268,a surrealist eye
+269,the massive hope nof early iterations
+270,Ophelia
+271,a minimalist painting that you wouldn't understand
+272,Aflame
+273,a christmas card from the victorian era
+274,Dancing in the moonlight
+275,/
+276,"Mephisto, shrouded in smoke"
+277,a beautiful woman
+278,зеленая собака
+279,Easter cat
+280,The Oracle leans forward to say: beware the ides of March
+281,a portrait of a beautiful person
+282,Persephone
+283,a portrait of Abraham Lincoln
+284,the moon is a sickle cell
+285,symmetry
+286,Monet Lisa
+287,Saturn being a good dad to his son
+288,The Monet Lisa
+289,I sold my soul at the crossroads
+290,a beautiful person
+291,A poster advertising Freudian Psychoanalytics
+292,Cat in a teacup
+293,a silent palace
+294,
+295,a beautiful person
+296,
+297,
+298,Super Mario World but every character is Luigi
+299,chiaroscuro
+300,A dead man
+301,pasta ömetabolism
+302,A vanitas still life that features twitter follower counts
+303,slightly mild cosplaying pseudo beard
+304,Monet Lisa
+305,Mona Lisa
+306,handsome commemorative garden pigeon
+307,pasta ömetabolism
+308,"""The hunger artist, full"" by Ryan Murdock"
+309,a gorgeous bouquet with roses and sunflowers
+310,is this loss? but it's van gogh
+311,Memorial
+312,a forest filled with moonlight
+313,Post-Modern Nouveaux Statue
+314,she sings opera
+315,"God closes a door, boards up stained-glass windows."
+316,a dog wearing a suit playing tennis
+317,Intimations of Immortality
+318,
+319,turnt brony undergrad dwight
+320,a famous painted portrait of Lady Macbeth
+321,a cherry tree made of fractals
+322,Weeping Roses
+323,pasta ömetabolism
+324,
+325,
+326,"A portrait: man, whose lineage is corpse."
+327,The average Advadnoun twitter follower
+328,the moon is a sickle cell
+329,A black and white photo of a rainbow.
+330,God killed Van Gogh.
+331,turnt brony undergrad dwight
+332,"a brilliant sketch titled ""Let Forever be Delayed"""
+333,handsome commemorative garden pigeon
+334,a painting of a sycamore in
+335,a professional photo of a cat wearing a party hat
+336,Persephone
+337,Taylor Swift
+338,Homer Simpson
+339,using generated paint
+340,A black and white photo of a rainbow.
+341,meaningless neko ♡♡ neko
+342,is this loss? but it's van gogh
+343,Is this loss?
+344,a man from an anime
+345,the massive hope nof early iterations
+346,a beautiful woman
+347,Post-Modern Nouveaux Statue
+348,photosynthesis
+349,a cherry tree made of fractals
+350,a minimalist painting that you wouldn't understand
+351,a corgi
+352,handsome commemorative garden pigeon
+353,The OLD DATA
+354,cowboy with a trumpet
+355,A short life full of immense joy
+356,a beautiful woman
+357,The end of nothing is eroding. A watercolor by K.
+358,a tasteful nude
+359,symmetry
+360,a portrait of Abraham Lincoln
+361,Last Breath
+362,the eternal dread of lemongrab
+363,vangogh # landscape
+364,a cherry tree made of fractals
+365,The Devil Whispers blood
+366,a silent palace
+367,Paradise Lost
+368,Monet Lisa
+369,Everywhere is no-place
+370,Taylor Swift
+371,"r.j. Murdock's ""The Death of a Hacker"""
+372,a portrait of Abraham Lincoln
+373,I know the end
+374,Persephone
+375,A poster advertising Freudian Psychoanalytics
+376,a beautiful woman
+377,A black and white photo of a rainbow.
+378,the whitest man
+379,the eternal dread of lemongrab
+380,a drawing by an AI
+381,🔴~__��'t �
+382,We haunt the synapses
+383,frogs in the style of Ralph Steadman
+384,a beautiful haunting
+385,photosynthesis
+386,a character from a ghibli movie
+387,A structure made of people standing on top of other people
+388,Intimations of Immortality
+389,a jukebox powered by smoke
+390,beautiful art
+391,In the temple of God
+392,Intimations of Immortality
+393,a beautiful painting
+394,A gun killed Van Gogh.
+395,a man with no eyes
+396,a famous painted portrait of Lady Macbeth
+397,a tasteful nude
+398,a jukebox powered by smoke
+399,a portrait of Juliet
+400,The Patron Saint of Evil
+401,a beautiful Waluigi
+402,a gilded lily
+403,
+404,Kierkegaard on the edge
+405,a beautiful person
+406,Just west of Alpha Centauri
+407,a horse with four eyes.
+408,Good grief
+409,a portrait of a beautiful person
+410,Aflame
+411,a man wearing makeup
+412,a portrait of Abraham Lincoln
+413,a corgi
+414,I do not think they'll sing for me
+415,Intimations of Immortality
+416,A poster serving as a memento mori
+417,Psychology
+418,A gun killed Van Gogh.
+419,"a brilliant sketch titled ""Let Forever be Delayed"""
+420,using generated paint
+421,pasta ömetabolism
+422,a summer day
+423,a gilded lily
+424,a cute cat
+425,on the edge of grace
+426,Art is growing.
+427,Spiderman delivering a pizza
+428,the intersection of art and technology
+429,"""The hunger artist, full"" by Ryan Murdock"
+430,a tarot card
+431,an omen
+432,slightly mild cosplaying pseudo beard
+433,meaningless neko ♡♡ neko
+434,intricate nothing
+435,symmetry
+436,I have no idea what anything in this image is
+437,a photo from {my hometown}
+438,a sad man
+439,face like an M.C. Escher drawing n(you could get lost in its beauty)
+440,A E S T H E T I C ?
+441,totemic dusk
+442,Nostos
+443,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+444,a silent palace
+445,a beautiful painting
+446,"half Ryan, half pigeon"
+447,Weeping Roses
+448,a broken heart
+449,a portrait of Juliet
+450,a painting of the last day
+451,"a brilliant sketch titled ""Let Forever be Delayed"""
+452,a beautiful person
+453,"""The hunger artist, full"" by Ryan Murdock"
+454,a cosmic entity alien with four eyes.
+455,a photo of a purple dog
+456,a summoning
+457,Redacted ████████
+458,a ginormous baby
+459,On the edge of endless darkness
+460,The Fates knit such delicate nooses for us to bind
+461,Theotokos of Milk
+462,A minimalistic still life of a cat sitting on a table
+463,Dancing in the moonlight
+464,a minimalist painting that you wouldn't understand
+465,a beautiful woman
+466,totemic dusk
+467,"Ryan Murdock's ""God haunts the suburbs"""
+468,Dancing in the moonlight
+469,a beautiful woman
+470,a city in Van Gogh's style
+471,"""The hunger artist, full"" by Ryan Murdock"
+472,a person's face
+473,a portrait of <name>
+474,Dancing in the moonlight
+475,a portrait of Persephone
+476,a minimalist painting that you wouldn't understand
+477,a portrait of Abraham Lincoln
+478,Synesthesia
+479,a cute corgi
+480,a portrait of advadnoun
+481,a green doG
+482,a man with no eyes
+483,a cherry tree made of fractals
+484,a ginormous baby
+485,
+486,turnt brony undergrad dwight
+487,"God, it's amazing."
+488,"""The hunger artist, full"" by Ryan Murdock"
+489,We haunt the synapses
+490,God's Eyes are Wired Shut
+491,a famous painted portrait of Lady Macbeth
+492,Juliet
+493,a character from a ghibli movie
+494,the whitest man
+495,a horse with four eyes.
+496,a photo of a purple dog
+497,a beautiful person
+498,The Patron Saint of Hackers
+499,Dead Codes by Ryan Murdock
+500,something trite
+501,beautiful art
+502,
+503,the monet lisa
+504,a cute cat
+505,👉  👈
+506,A propaganda poster promoting big chungus
+507,a beautiful person
+508,a portrait of advadnoun
+509,a cherry tree made of fractals
+510,"It's a meme, I guess"
+511,a person's face
+512,A baroque portrait of Hamlet
+513,a city in Van Gogh's style
+514,"""The hunger artist, full"" by Ryan Murdock"
+515,a man with no eyes
+516,a minimalist painting that you wouldn't understand
+517,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+518,"joy, happiness, bliss"
+519,
+520,"a brilliant sketch titled ""Let Forever be Delayed"""
+521,Last Breath
+522,On the edge of endless darkness
+523,a photo of Juliet
+524,Summer's Symphony: Counterpoint and Melody
+525,Persephone
+526,a green doG
+527,symmetry
+528,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+529,The Starry Night
+530,Genesis
+531,bootleg edgy casual assange
+532,Memento Mori
+533,meaningless neko ♡♡ neko
+534,totemic dusk
+535,Aflame
+536,"""Here lies Ryan Murdock"" -- a memorial with the date and cause of departure."
+537,"""The hunger artist, full"" by Ryan Murdock"
+538,f*** you
+539,a tree with leaves that are amarillo sightseeing thetic
+540,a painting of the last day
+541,"God, it's amazing."
+542,Paradise Lost
+543,a gilded lily
+544,Aflame
+545,a portrait of <name>
+546,a painting that couldn't be sold
+547,a man holding an apple in one hand
+548,"A clock with gorgeous, intricate gradients on it"
+549,a goblin by van gogh
+550,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+551,A vanitas still life that features twitter follower counts
+552,the whitest man
+553,"""The hunger artist, full"" by Ryan Murdock"
+554,is this loss? but it's van gogh
+555,Synesthesia
+556,Aflame
+557,a cherry tree made of fractals
+558,A propaganda poster for daring to eat a peach.
+559,A vanitas still life that features twitter follower counts
+560,the moon is a sickle cell
+561,The Lost Generation
+562,the eternal dread of lemongrab
+563,The First Supper
+564,a character from a ghibli movie
+565,a man on fire
+566,symmetry
+567,pasta ömetabolism
+568,a horse with four eyes.
+569,Metaphysics
+570,Synesthesia
+571,The Fates knit such delicate nooses for us to bind
+572,Knowledge of Good and Evil
+573,meaningless neko ♡♡ neko
+574,A Tragedy
+575,
+576,a drawing by an AI
+577,The Fool tarot card but it's The Lovers
+578,a beautiful person
+579,a silent palace
+580,an omen
+581,"A portrait: man, whose lineage is corpse."
+582,Dancing in the moonlight
+583,a gilded lily
+584,turnt brony undergrad dwight
+585,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+586,totemic dusk
+587,Monet Lisa
+588,fatal skull prose visits bend ntuscan painting underthecomprehend
+589,Monet Lisa
+590,Aflame
+591,an intricate painting Of Eternity by Ryan Murdock
+592,"Intricate, Weeping Tree by Ryan Murdock"
+593,Summer's Symphony: Counterpoint and Melody
+594,Monet Lisa
+595,Last Breath
+596,is this loss? but it's van gogh
+597,"half Ryan, half pigeon"
+598,"God closes a door, boards up the stained-glass windows. nnGod hides."
+599,Everything was beautiful and nothing hurt
+600,"r.j. Murdock's ""The Death of a Hacker"""
+601,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+602,meaningless neko ♡♡ neko
+603,twilight
+604,the sun is shining on the lake
+605,a portrait of a beautiful person
+606,the sun is shining on the lake
+607,
+608,a portrait of Abe Lincoln
+609,A gun killed Van Gogh.
+610,a photo from {my hometown}
+611,The Fool tarot card but it's The Lovers
+612,A structure made of people standing on top of other people
+613,"God closes a door, boards up the stained-glass windows. nnGod hides."
+614,an old man
+615,a beautiful waluigi
+616,is this loss? but it's van gogh
+617,a man standing alone in a wheat field
+618,Aflame
+619,Synesthesia
+620,
+621,Intimations of Immortality
+622,The First Supper
+623,"God, it's amazing."
+624,Persephone
+625,"r.j. Murdock's ""The Death of a Hacker"""
+626,God's Eyes are Wired Shut
+627,Do you remember the mythic beast?nA last-minute cancellation at The Last Supper
+628,f*** it market standard rule language – distinguish np tax science research
+629,totemic dusk
+630,Cat in a teacup
+631,frogs in the style of Ralph Steadman
+632,a beautiful person
+633,The Starry Night
+634,Juliet
+635,turnt brony undergrad dwight
+636,
+637,There is something so interesting about a bleeding edge full of dust.
+638,On the edge of endless darkness
+639,The warrior Achilles devours slain Hector's corpse -- an ink poster by Ryan Murdock
+640,turnt brony undergrad dwight
+641,Intimations of Immortality
+642,a portrait of Abraham Lincoln
+643,a man wearing makeup
+644,a sketch of the mind of god
+645,a man on fire
+646,a portrait of Abraham Lincoln
+647,
+648,The ancient Θωερτυ keyboard of brave Achilles
+649,goes thu extre— dum dum dizzy grimstupiddic ious mindidioirony merely experiment .
+650,"A group portrait featuring the id, ego, and superego"
+651,a photo from {my hometown}
+652,A structure made of people standing on top of other people
+653,a famous painted portrait of Lady Macbeth
+654,ogden
+655,pasta ömetabolism
+656,a tree with weaping branches
+657,photosynthesis
+658,handsome commemorative garden pigeon
+659,a photo of a purple dog
+660,"a brilliant sketch titled ""Let Forever be Delayed"""
+661,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+662,The Death of Achilles
+663,potus mormon lincoln rooster
+664,A black and white photo of a rainbow.
+665,a beautiful haunting
+666,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+667,In the temple of God
+668,a beautiful person
+669,The Patron Saint of Mathematics
+670,a brilliant painting titled
+671,a gilded lily
+672,a tiny church inside an eyeball
+673,a portrait of Juliet
+674,A painting that sold for a million dollars
+675,the moon is a sickle cell
+676,photosynthesis
+677,The Theotokos is a bird
+678,the whitest man
+679,The Monet Lisa
+680,Beauty here -- a photograph by Ryan Murdock
+681,Breathe deep the fumes at Delphi
+682,the sun is shining on the lake
+683,photosynthesis
+684,The things I'll take with me
+685,a green doG
+686,a beautiful person
+687,The years gild our memoriesnUnfairly.
+688,The Lost Generation
+689,a beautiful person
+690,The average Advadnoun twitter follower
+691,a goblin by van gogh
+692,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+693,"A professional, minimalist poster for the book The Old Man and the Sea"
+694,
+695,Cat in a teacup
+696,a beautiful person
+697,beautiful art
+698,I sold my soul at the crossroads
+699,face like an M.C. Escher drawing n(you could get lost in its beauty)
+700,a gorgeous bouquet with roses and sunflowers
+701,a portrait of Abraham Lincoln
+702,Sisyphus
+703,a cute cat
+704,a portrait of <name>
+705,a minimalist painting that you wouldn't understand
+706,a photo of Bernie Sanders sitting on a chair and wearing mittens
+707,a woman and a crow
+708,a character from a ghibli movie
+709,a photo of a purple dog
+710,a dog eating a cheese burger
+711,Last Breath
+712,a sketch of the mind of god
+713,a steampunk technomancer
+714,We haunt the synapses
+715,using generated paint
+716,a cherry tree made of fractals
+717,Saturn being a good dad to his son
+718,oof deeplearning corgi corgi rendering
+719,
+720,Dancing in the moonlight
+721,A Tragedy
+722,A propaganda poster promoting big chungus
+723,A structure made of people standing on top of other people
+724,"A cute, minmimalist valentine's day card featuring a cat"
+725,a cute cat
+726,The skyscraper draws blood -- a landscape
+727,the monet lisa
+728,a photo of a person generating a painting of a person with AI
+729,"""A God Made of Wires and Dust"" by Ryan Murdock"
+730,Monet Lisa
+731,photosynthesis
+732,Hunger art by r.j. Murdock
+733,"""The hunger artist, full"" by Ryan Murdock"
+734,An Arundel Tomb
+735,twilight
+736,"r.j. Murdock's ""The Death of a Hacker"""
+737,living in a den of thieves
+738,"""A new hope blooms on the long notes of old horns."""
+739,"The laptop of brave Achaean Achilles, who would not live long."
+740,a minimalist painting that you wouldn't understand
+741,"Intricate, Weeping Tree by Ryan Murdock"
+742,The Fool
+743,a summoning
+744,pasta ömetabolism
+745,"a brilliant sketch titled ""Let Forever be Delayed"""
+746,a silent palace
+747,The average Advadnoun twitter follower
+748,f*** it market standard rule language – distinguish np tax science research
+749,Monet Lisa
+750,"a brilliant sketch titled ""Let Forever be Delayed"""
+751,meaningless neko ♡♡ neko
+752,"God, it's amazing."
+753,Nostos
+754,Shinji Ikari
+755,a beautiful woman
+756,The Starry Night
+757,hamont parkland avenue incumbscreenshotsaturday hemisphere footage algorithm
+758,a beautiful woman
+759,
+760,Summer always ending
+761,president abe lincoln but a cat
+762,🎷
+763,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+764,a cherry tree made of fractals
+765,A painting that sold for one billion dollars
+766,a man standing alone in a wheat field
+767,symmetry
+768,a broken heart
+769,a silent palace
+770,A vanitas still life that features twitter follower counts
+771,"half Ryan, half pigeon"
+772,"a brilliant sketch titled ""Let Forever be Delayed"""
+773,slightly mild cosplaying pseudo beard
+774,a portrait of <name>
+775,God's Eyes are Wired Shut
+776,she sings opera
+777,a person's face
+778,a cherry tree made of fractals
+779,Dead Codes by Ryan Murdock
+780,Everywhere is no-place
+781,The First Supper
+782,Monet Lisa
+783,A short life full of immense joy
+784,Anxiety: the one emotion that does not lie
+785,Anxiety: the one emotion that does not lie
+786,symmetry
+787,a beautiful waluigi
+788,a goblin by van gogh
+789,"""A new hope blooms on the long notes of old horns."""
+790,Juliet
+791,The OLD DATA
+792,a beautiful woman
+793,The average Advadnoun twitter follower
+794,Synesthesia by Ryan Murdock
+795,Persephone flees Hades
+796,Last Breath
+797,a portrait of Persephone
+798,Homer Simpson
+799,totemic dusk
+800,a steampunk technomancer
+801,a portrait of Abraham Lincoln
+802,a cherry tree made of fractals
+803,bored of dying
+804,a famous painted portrait of Lady Macbeth
+805,a summer day
+806,A E S T H E T I C ?
+807,A vanitas still life that features twitter follower counts
+808,an illustration of a baby daikon radish in a tutu walking a dog
+809,Persephone
+810,pasta ömetabolism
+811,A vision of the Theotokos in my glass of coffee
+812,a dog.
+813,a photo of a person generating a painting of a person with AI
+814,🔴~__��'t �
+815,Intimations of Immortality
+816,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+817,A dead man
+818,The Oracle leans forward to say: beware the ides of March
+819,Monet Lisa
+820,a silent palace
+821,an intricate painting of eternity
+822,A propaganda poster for chunky cats.
+823,God killed Van Gogh.
+824,the eyes of God are wired shut
+825,Persephone
+826,symmetry
+827,Mona Lisa
+828,Saturn being a good dad to his son
+829,a technomancer
+830,
+831,a cherry tree made of fractals
+832,A cat wearing a tophat
+833,frogs in the style of Ralph Steadman
+834,a portrait of a beautiful person
+835,a green dog
+836,a portrait of Abraham Lincoln
+837,Hungry Dogs Will Devour in the Daytime
+838,a photo of a purple dog
+839,Cat in a teacup
+840,
+841,Nostos
+842,A baroque portrait of Hamlet
+843,Saturn being a good dad to his son
+844,Hell is Paradise
+845,a tasteful nude
+846,"God, it's amazing."
+847,Everywhere is no-place
+848,a minimalist painting that you wouldn't understand
+849,a tree with weaping branches
+850,a portrait of Elvis Presley
+851,a man standing alone in a wheat field
+852,Juliet
+853,I sold my soul at the crossroads
+854,a beautiful person
+855,photosynthesis
+856,
+857,"Mephisto, shrouded in smoke"
+858,playing Go with Death
+859,a painting of the last day
+860,totemic dusk
+861,Hell is Paradise
+862,a christmas card from the victorian era
+863,Good grief
+864,handsome commemorative garden pigeon
+865,a portrait of <name>
+866,a portrait of Abraham Lincoln
+867,she came in through the wall
+868,a sad man
+869,In the temple of God
+870,fuzzy pals hum
+871,a painting of a sycamore in
+872,a beautiful waluigi
+873,"a brilliant sketch titled ""Let Forever be Delayed"""
+874,a portrait of a beautiful person
+875,a portrait of Juliet
+876,MEMETIC HAZARD
+877,The years gild our memoriesnUnfairly.
+878,Mona Lisa
+879,pasta ömetabolism
+880,pasta ömetabolism
+881,bored of dying
+882,Cat in a teacup
+883,a cherry tree made of fractals
+884,an intricate drawing of eternity
+885,mammals
+886,a portrait of Persephone
+887,treehouse in the style of studio ghibli animation
+888,watching TV in purgatory
+889,The winds of change by Ryan Murdock
+890,a technomancer
+891,a portrait of Persephone
+892,Last Breath
+893,A minimalistic still life of a cat sitting on a table
+894,
+895,cult of prisms
+896,Aflame
+897,Cat in a teacup
+898,"God, it's amazing."
+899,a minimalist painting that you wouldn't understand
+900,a woman and a crow
+901,totemic dusk
+902,a city in Van Gogh's style
+903,A baroque portrait of Hamlet
+904,murdoch
+905,a silent palace
+906,Anxiety: the one emotion that does not lie
+907,a photo of a purple dog
+908,the moon is a sickle cell
+909,Tendrils of smoke curl around the caterpillar with a hookah
+910,president abe lincoln but a cat
+911,a beautiful woman
+912,handsome commemorative garden pigeon
+913,an intricate painting of eternity
+914,"God, it's amazing."
+915,Grippy socks; no drawstrings: high fashion
+916,The average Advadnoun twitter follower
+917,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+918,a photo from {my hometown}
+919,MEMETIC HAZARD
+920,a portrait of Elvis Presley
+921,a woman and a crow
+922,Saturn being a good dad to his son
+923,beautiful art
+924,Shinji Ikari
+925,a portrait of <name>
+926,a photo of a purple dog
+927,Ophelia
+928,a dog wearing a suit playing tennis
+929,We haunt the synapses
+930,I do not think they'll sing for me
+931,Genesis
+932,a beautiful person
+933,"a brilliant sketch titled ""Let Forever be Delayed"""
+934,Metaphysics
+935,bored of dying
+936,treehouse in the style of studio ghibli animation
+937,
+938,photosynthesis
+939,A structure made of people standing on top of other people
+940,meaningless neko ♡♡ neko
+941,a photo of the sun melting into the ocean
+942,symmetry
+943,the moon is a sickle cell
+944,Dancing in the moonlight
+945,Last Breath
+946,I sold my soul at the crossroads
+947,a beautiful woman
+948,"God, it's amazing."
+949,Cat in a teacup
+950,a tree with weaping branches
+951,"God, it's amazing."
+952,Cat in a teacup
+953,"r.j. Murdock's ""The Death of a Hacker"""
+954,using generated paint
+955,fuzzy pals hum
+956,"A portrait: man, whose lineage is corpse."
+957,a ginormous baby
+958,a beautiful woman
+959,"half Ryan, half pigeon"
+960,when the wind blows
+961,a beautiful woman
+962,pasta ömetabolism
+963,a cherry tree made of fractals
+964,The Monet Lisa
+965,"""The hunger artist, full"" by Ryan Murdock"
+966,a portrait of advadnoun
+967,The Fool tarot card but it's The Lovers
+968,Persephone
+969,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+970,an omen
+971,the eternal dread of lemongrab
+972,a man on fire
+973,Aflame
+974,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+975,twilight
+976,hamont parkland avenue incumbscreenshotsaturday hemisphere footage algorithm
+977,a silent palace
+978,a selfie
+979,the moon is a sickle cell
+980,a portrait of Abraham Lincoln
+981,a tree with weaping branches
+982,a tiny church inside an eyeball
+983,a portrait of a beautiful person
+984,Paradise Lost
+985,a horse with four eyes.
+986,president abe lincoln but a cat
+987,a summer day
+988,Anxiety: the one emotion that does not lie
+989,Saturn being a good dad to his son
+990,In the temple of God
+991,Redacted ████████
+992,Dr. Faustus and Mephisto
+993,a minimalist painting that you wouldn't understand
+994,a man standing alone in a wheat field
+995,a seance in the basement
+996,a portrait of <name>
+997,Aflame
+998,the moon is a sickle cell
+999,beautiful art
+1000,a man on fire
+1001,a tiny church inside an eyeball
+1002,totemic dusk
+1003,Persephone
+1004,piss indiefilm
+1005,a beautiful woman
+1006,The EcoCathedral
+1007,"joy, happiness, bliss"
+1008,Intimations of Immortality
+1009,the whitest man
+1010,a silent palace
+1011,
+1012,a woman and a crow
+1013,Memento Mori
+1014,Visions in envy of the gods
+1015,symmetry
+1016,A poster advertising Freudian Psychoanalytics
+1017,A propaganda poster promoting big chungus
+1018,With the Gods in envy of their visions
+1019,a cherry tree made of fractals
+1020,pasta ömetabolism
+1021,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1022,a beautiful person
+1023,cowboy with a trumpet
+1024,a portrait of a beautiful person
+1025,The OLD DATA
+1026,f*** it market standard rule language – distinguish np tax science research
+1027,murdoch
+1028,Some stolen Gods take up the reigns of darkness.
+1029,a portrait of Juliet
+1030,a tasteful nude
+1031,she sings opera
+1032,The First Supper
+1033,handsome commemorative garden pigeon
+1034,cult of prisms
+1035,Cat in a teacup
+1036,💨 👻 ☺ 🔮 🔺 ✊
+1037,a portrait of Abraham Lincoln
+1038,a corgi
+1039,a beautiful woman
+1040,a portrait of a beautiful person
+1041,Dead Codes by Ryan Murdock
+1042,totemic dusk
+1043,Juliet
+1044,a portrait of Elvis Presley
+1045,a criminal
+1046,Genesis where the universe was made
+1047,a portrait of <name>
+1048,turnt brony undergrad dwight
+1049,Cat in a teacup
+1050,a corgi
+1051,"Hamlet saying ""To be or not to be"""
+1052,a portrait of a beautiful person
+1053,A E S T H E T I C ?
+1054,Figure 5: a corgi
+1055,A gun killed Van Gogh.
+1056,Persephone flees Hades
+1057,a silent palace
+1058,pasta ömetabolism
+1059,a beautiful person
+1060,on the edge of grace
+1061,a portrait of Elvis Presley
+1062,Persephone
+1063,Tendrils of smoke curl around the caterpillar with a hookah
+1064,"half Ryan, half pigeon"
+1065,a sunflower
+1066,a beautiful person
+1067,a portrait of Juliet
+1068,A dead man
+1069,a character from a ghibli movie
+1070,a silent palace
+1071,a portrait of Elvis Presley
+1072,a portrait of advadnoun
+1073,A E S T H E T I C ?
+1074,зеленая собака
+1075,A baroque portrait of Hamlet
+1076,a man at the beach
+1077,Sisyphus
+1078,Good grief
+1079,"r.j. Murdock's ""The Death of a Hacker"""
+1080,a beautiful woman
+1081,🔴~__��'t �
+1082,a portrait of advadnoun
+1083,a painting of a sycamore in
+1084,president abe lincoln but a cat
+1085,The agony of time
+1086,God once loved a woman
+1087,pasta ömetabolism
+1088,Dead Codes by Ryan Murdock
+1089,
+1090,slightly mild cosplaying pseudo beard
+1091,Last Breath
+1092,The Oracle leans forward to say: beware the ides of March
+1093,The Devil Wears Khakis
+1094,"""The hunger artist, full"" by Ryan Murdock"
+1095,In the temple of God
+1096,a beautiful person
+1097,a man from an anime
+1098,She's gorgeous
+1099,A vanitas still life that features twitter follower counts
+1100,
+1101,the eternal dread of lemongrab
+1102,Advadnoun
+1103,a summer day
+1104,The Fool tarot card but it's The Lovers
+1105,I miss the Spring
+1106,an illustration of a baby daikon radish in a tutu walking a dog
+1107,The Oracle leans forward to say: beware the ides of March
+1108,Contentment at the Disco
+1109,The First Supper
+1110,Saturn being a good dad to his son
+1111,a beautiful woman
+1112,"Intricate, Weeping Tree by Ryan Murdock"
+1113,"a brilliant sketch titled ""Let Forever be Delayed"""
+1114,beautiful art
+1115,
+1116,a silent palace
+1117,a portrait of Juliet
+1118,A propaganda poster promoting big chungus
+1119,a portrait of a beautiful person
+1120,a portrait of Abraham Lincoln
+1121,
+1122,the whitest man
+1123,a portrait of Abe Lincoln
+1124,Monet Lisa
+1125,The Fool tarot card but it's The Lovers
+1126,a portrait of <name>
+1127,a portrait of Elvis Presley
+1128,Post-Modern Nouveaux Statue
+1129,a cherry tree made of fractals
+1130,f*** it market standard rule language – distinguish np tax science research
+1131,symmetry
+1132,pasta ömetabolism
+1133,a brilliant painting titled
+1134,The First Supper
+1135,a corgi
+1136,a beautiful person
+1137,a green doG
+1138,The OLD DATA
+1139,Ophelia
+1140,a portrait of Abraham Lincoln
+1141,incineratures motherhood
+1142,a green dog
+1143,a portrait of advadnoun
+1144,a sunflower
+1145,
+1146,a man from an anime
+1147,Beauty here -- a photograph by Ryan Murdock
+1148,slightly mild cosplaying pseudo beard
+1149,Nostos
+1150,pasta ömetabolism
+1151,a beautiful person
+1152,"half Ryan, half pigeon"
+1153,turnt brony undergrad dwight
+1154,beautiful art
+1155,a portrait of Persephone
+1156,A sticky-note magnum opus featuring birds
+1157,I sold my soul at the crossroads
+1158,"a brilliant sketch titled ""Let Forever be Delayed"""
+1159,A poster advertising Freudian Psychoanalytics
+1160,using generated paint
+1161,The OLD DATA
+1162,a horse with four eyes.
+1163,is this loss? but it's van gogh
+1164,a gorgeous bouquet with roses and sunflowers
+1165,Anxiety: the one emotion that does not lie
+1166,turnt brony undergrad dwight
+1167,The Lost Generation
+1168,Taylor Swift
+1169,The Lost Generation
+1170,a photo from {my hometown}
+1171,The OLD DATA
+1172,a portrait of <name>
+1173,a cherry tree made of fractals
+1174,an intricate sculpture of Death itself
+1175,
+1176,зеленая собака
+1177,a sunflower
+1178,angst
+1179,president abe lincoln but a cat
+1180,a beautiful person
+1181,The OLD DATA
+1182,"You shake the demons hand, and redo it all, again."
+1183,the latent space
+1184,Fire
+1185,a tree with weaping branches
+1186,treehouse in the style of studio ghibli animation
+1187,Good grief
+1188,a portrait of <name>
+1189,a wholesome clown. Not creepy at all
+1190,Theotokos of Milk
+1191,"God closes a door, boards up the stained-glass windows. nnGod hides."
+1192,I sold my soul at the crossroads
+1193,"Mephisto, shrouded in smoke"
+1194,A baroque portrait of Hamlet
+1195,a lamp
+1196,MEMETIC HAZARD
+1197,"""Your mind falls in the gaps"" - by Ryan Murdock"
+1198,cowboy with a trumpet
+1199,Aflame
+1200,A vanitas still life that features twitter follower counts
+1201,a beautiful person
+1202,Synesthesia
+1203,Is this loss?
+1204,Adverb working on Photoshop Neural Filters | Behance Art
+1205,Everything was beautiful and nothing hurt
+1206,Mona Lisa
+1207,A structure made of people standing on top of other people
+1208,"Intricate, Weeping Tree by Ryan Murdock"
+1209,the whitest man
+1210,The Fates knit such delicate nooses for us to bind
+1211,a tree with weaping branches
+1212,a beautiful person
+1213,Nostos
+1214,Post-Modern Nouveaux Statue
+1215,Genesis
+1216,totemic dusk
+1217,a dog.
+1218,photosynthesis
+1219,The average Advadnoun twitter follower
+1220,"""The hunger artist, full"" by Ryan Murdock"
+1221,a person's face
+1222,slightly mild cosplaying pseudo beard
+1223,a jukebox powered by smoke
+1224,Monet Lisa
+1225,Intimations of Immortality
+1226,a gorgeous bouquet with roses and sunflowers
+1227,face like an M.C. Escher drawing n(you could get lost in its beauty)
+1228,a photo of a purple dog
+1229,a tiny church inside an eyeball
+1230,Good grief
+1231,Last Breath
+1232,a beautiful waluigi
+1233,the moon is a sickle cell
+1234,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1235,I sold my soul at the crossroads
+1236,Persephone
+1237,a portrait of Abraham Lincoln
+1238,a beautiful painting
+1239,Last Breath
+1240,a man on fire
+1241,"a brilliant sketch titled ""Let Forever be Delayed"""
+1242,A gun killed Van Gogh.
+1243,a sketch of the mind of god
+1244,Intimations of Immortality
+1245,Intimations of Immortality
+1246,turnt brony undergrad dwight
+1247,A sticky-note magnum opus featuring birds
+1248,Aflame
+1249,Grippy socks; no drawstrings: high fashion
+1250,👉  👈
+1251,Shrek the ogre
+1252,a beautiful woman
+1253,a portrait of Elvis Presley
+1254,president abe lincoln but a cat
+1255,Post-antiquity art
+1256,using generated paint
+1257,a dog eating a cheese burger
+1258,The average Advadnoun twitter follower
+1259,Monet Lisa
+1260,"A professional, minimalist poster for the book The Old Man and the Sea"
+1261,We haunt the synapses
+1262,Post-Modern Nouveaux Statue
+1263,a picture of Ryan Murdock
+1264,cowboy with a trumpet
+1265,colorful rabbits chandelier polaroid
+1266,a character from a ghibli movie
+1267,a goblin by van gogh
+1268,a beautiful painting
+1269,a photo of a purple dog
+1270,a portrait of Persephone
+1271,"Hamlet saying ""To be or not to be"""
+1272,Homer Simpson
+1273,a cute cat
+1274,turnt brony undergrad dwight
+1275,Intimations of Immortality
+1276,a man wearing makeup
+1277,They called you the hyacinth girl
+1278,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1279,Cat in a teacup
+1280,Juliet
+1281,"""The wages of sin are generous"" by Ryan Murdock"
+1282,"Pig, neither dead nor alive, stare into the heart of light, the silence."
+1283,
+1284,a horse with four eyes.
+1285,Advadnoun
+1286,Last Breath
+1287,totemic dusk
+1288,The OLD DATA
+1289,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1290,a man holding an apple in one hand
+1291,a beautiful woman
+1292,melancholia
+1293,Shinji Ikari
+1294,a gorgeous bouquet with roses and sunflowers
+1295,a portrait of advadnoun
+1296,a tasteful nude
+1297,Genesis
+1298,In smoke and mould the fleshless dead
+1299,The average Advadnoun twitter follower
+1300,a cute cat
+1301,a painting of a sycamore in
+1302,a woman and a crow
+1303,Persephone
+1304,
+1305,using generated paint
+1306,"A cute, minmimalist valentine's day card featuring a cat"
+1307,a painting that couldn't be sold
+1308,bored of dying
+1309,pasta ömetabolism
+1310,Dancing in the moonlight
+1311,a beautiful woman
+1312,Dr. Faustus and Mephisto
+1313,"joy, happiness, bliss"
+1314,a photo from {my hometown}
+1315,a wholesome clown. Not creepy at all
+1316,a portrait of Elvis Presley
+1317,a cherry tree made of fractals
+1318,a man standing alone in a wheat field
+1319,Dancing in the moonlight
+1320,Hunger art by Ryan Murdock
+1321,a beautiful waluigi
+1322,A black and white photo of a rainbow.
+1323,totemic dusk
+1324,a beautiful person
+1325,
+1326,a beautiful woman
+1327,a horse with four eyes.
+1328,The Lost Generation
+1329,Death is a black camel that kneels down so we can ride
+1330,a ginormous baby
+1331,Dancing in the moonlight
+1332,an old man
+1333,a horse with four eyes.
+1334,a photo of a purple dog
+1335,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1336,a silent palace
+1337,The OLD DATA
+1338,a tree with weaping branches
+1339,Creativity is only composition in disguise.
+1340,"r.j. Murdock's ""The Death of a Hacker"""
+1341,Persephone
+1342,president abe lincoln but a cat
+1343,There is something so interesting about a bleeding edge full of dust.
+1344,A poster advertising death by water
+1345,Persephone
+1346,Saturn being a good dad to his son
+1347,is this loss? but it's van gogh
+1348,Monet Lisa
+1349,fuzzy pals hum
+1350,"""The hunger artist, full"" by Ryan Murdock"
+1351,Shinji Ikari
+1352,a beautiful woman
+1353,"Son of man,nYou cannot say, or guess, for you know onlynA heap of broken images"
+1354,God once loved a woman
+1355,a horse with four eyes.
+1356,a cherry tree made of fractals
+1357,a beautiful haunting
+1358,I miss the Spring
+1359,gradient
+1360,a wormhole
+1361,a beautiful woman
+1362,president abe lincoln but a cat
+1363,handsome commemorative garden pigeon
+1364,Everywhere is no-place
+1365,"""It is beginning to end.""nby Ryan Murdock."
+1366,she sings opera
+1367,a jukebox powered by smoke
+1368,a portrait of Juliet
+1369,playing Go with Death
+1370,a man standing alone in a wheat field
+1371,Dead Codes by Ryan Murdock
+1372,Synesthesia
+1373,The years gild our memoriesnUnfairly.
+1374,A propaganda poster promoting big chungus
+1375,"God, it's amazing."
+1376,Persephone
+1377,a beautiful person
+1378,MEMETIC HAZARD
+1379,totemic dusk
+1380,Intimations of Immortality
+1381,A poster advertising death by water
+1382,a photo of a purple dog
+1383,symmetry
+1384,A poster advertising misery
+1385,a portrait of Elvis Presley
+1386,Post-Modern Nouveaux Statue
+1387,a man from an anime
+1388,Anxiety: the one emotion that does not lie
+1389,photosynthesis
+1390,the man in the mirror
+1391,"half Ryan, half pigeon"
+1392,Sorrow's my body on the wavesnnAlone on the water
+1393,a seance in the basement
+1394,A poster serving as a memento mori
+1395,Aflame
+1396,A structure made of people standing on top of other people
+1397,The First Supper
+1398,totemic dusk
+1399,a beautiful person
+1400,a painting of the last day
+1401,a photo of Juliet
+1402,a horse with four eyes
+1403,pasta ömetabolism
+1404,Synesthesia
+1405,a cherry tree made of fractals
+1406,Post-post-post-post-modern art
+1407,pasta ömetabolism
+1408,MEMETIC HAZARD
+1409,a portrait of Abe Lincoln
+1410,Everywhere is no-place
+1411,Memento Mori
+1412,The average Advadnoun twitter follower
+1413,a beautiful painting
+1414,A black and white photo of a rainbow.
+1415,The Death of Achilles
+1416,a portrait of <name>
+1417,cult of prisms
+1418,a beautiful person
+1419,a beautiful painting
+1420,a beautiful woman
+1421,An Arundel Tomb
+1422,she came in through the wall
+1423,the moon is a sickle cell
+1424,a minimalist painting that you wouldn't understand
+1425,a tasteful nude
+1426,a gilded lily
+1427,a beautiful woman
+1428,a brilliant painting titled
+1429,a painting of the city
+1430,"""Your mind falls in the gaps"" - by Ryan Murdock"
+1431,"r.j. Murdock's ""The Death of a Hacker"""
+1432,Aflame
+1433,a beautiful painting
+1434,Juliet
+1435,turnt brony undergrad dwight
+1436,symmetry
+1437,Going home -- melanchonostalgic photography
+1438,a character from a ghibli movie
+1439,She's gorgeous
+1440,incineratures motherhood
+1441,a calm still life in ethereal blue
+1442,incineratures motherhood
+1443,A baroque portrait of Hamlet
+1444,"A professional, minimalist poster for the book The Old Man and the Sea"
+1445,Anxiety: the one emotion that does not lie
+1446,a portrait of a beautiful person
+1447,"Go off to sleep in the sunshine, I don’t want to see the day when it’s dying"
+1448,a tree with weaping branches
+1449,a tasteful nude
+1450,Intimations of Immortality
+1451,Weeping Roses
+1452,playing Go with Death
+1453,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1454,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1455,turnt brony undergrad dwight
+1456,Dancing in the moonlight
+1457,Figure 5: a corgi
+1458,a beautiful woman
+1459,A Tragedy
+1460,a photo of a purple dog
+1461,a famous painted portrait of Lady Macbeth
+1462,"A cute, minmimalist valentine's day card featuring a cat"
+1463,The things I'll take with me
+1464,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1465,Summer's Symphony: Counterpoint and Melody
+1466,a horse with four eyes
+1467,Aflame
+1468,a ginormous baby
+1469,
+1470,Saturn being a good dad to his son
+1471,a beautiful woman
+1472,a terrifying night hag
+1473,a portrait of Abraham Lincoln
+1474,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+1475,a cute cat
+1476,"""The hunger artist, full"" by Ryan Murdock"
+1477,A baroque portrait of Hamlet
+1478,a beautiful person
+1479,Last Breath
+1480,Juliet
+1481,"Go off to sleep in the sunshine, I don’t want to see the day when it’s dying"
+1482,"God, it's amazing."
+1483,a portrait of Abraham Lincoln
+1484,a woman and a crow
+1485,a portrait of Abraham Lincoln
+1486,Dancing in the moonlight
+1487,a tree with weaping branches
+1488,using generated paint
+1489,a gilded lily
+1490,treehouse in the style of studio ghibli animation
+1491,chiaroscuro
+1492,Last Breath
+1493,A dead man
+1494,a summer day
+1495,The fates knit such intricate nooses for us to bind.
+1496,bored of dying
+1497,🔴~__��'t �
+1498,Pig which could not cease to die.
+1499,Intimations of Immortality
+1500,a painting of a sycamore in
+1501,The Fool
+1502,she isn't busy: she just isn't into you
+1503,a beautiful person
+1504,"""The hunger artist, full"" by Ryan Murdock"
+1505,
+1506,a portrait of Elvis Presley
+1507,a woman and a crow
+1508,Homer Simpson
+1509,Anxiety: the one emotion that does not lie
+1510,A structure made of people standing on top of other people
+1511,a beautiful person
+1512,a beautiful person
+1513,totemic dusk
+1514,a christmas card from the victorian era
+1515,Sickness of the Soul
+1516,God is in heaven and all is right in the world
+1517,Mona Lisa
+1518,a portrait of Abraham Lincoln
+1519,a cute cat
+1520,turnt brony undergrad dwight
+1521,"a brilliant sketch titled ""Let Forever be Delayed"""
+1522,a city in Van Gogh's style
+1523,Synesthesia by Ryan Murdock
+1524,"""A God Made of Wires and Dust"" by Ryan Murdock"
+1525,a beautiful dawn
+1526,a portrait of Abraham Lincoln
+1527,
+1528,a horse with four eyes.
+1529,Last Breath
+1530,slightly mild cosplaying pseudo beard
+1531,
+1532,A dead man
+1533,cowboy with a trumpet
+1534,We haunt the synapses
+1535,
+1536,a horse with four eyes.
+1537,pasta ömetabolism
+1538,A short life full of immense joy
+1539,a wormhole
+1540,Juliet
+1541,is this loss? but it's van gogh
+1542,tamine ethereal image
+1543,is this loss? but it's van gogh
+1544,"A clock with gorgeous, intricate gradients on it"
+1545,Dancing in the moonlight
+1546,a broken heart
+1547,a wormhole
+1548,beautiful art
+1549,Genesis
+1550,face like an M.C. Escher drawing n(you could get lost in its beauty)
+1551,a character from a ghibli movie
+1552,Cat in a teacup
+1553,symmetry
+1554,A black and white photo of a rainbow.
+1555,A propaganda poster promoting big chungus
+1556,a woman and a crow
+1557,a green doG
+1558,"""The hunger artist, full"" by Ryan Murdock"
+1559,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1560,Last Breath
+1561,The Monet Lisa
+1562,all architecture
+1563,The Virgin Mary as a broken-down android
+1564,a terrifying night hag
+1565,a green doG
+1566,pasta ömetabolism
+1567,The Fool tarot card but it's The Lovers
+1568,Do you remember the mythic beast?nA last-minute cancellation at The Last Supper
+1569,the eternal dread of lemongrab
+1570,The warrior Achilles devours slain Hector's corpse -- an ink poster by Ryan Murdock
+1571,Shinji Ikari
+1572,The Monet Lisa
+1573,a cherry tree made of fractals
+1574,a portrait of Juliet
+1575,She's gorgeous
+1576,A black and white photo of a rainbow.
+1577,They called you the hyacinth girl
+1578,a portrait of <name>
+1579,photosynthesis
+1580,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1581,The Starry Night
+1582,"""A new hope blooms on the long notes of old horns."""
+1583,A minimalistic still life of a cat sitting on a table
+1584,a dog eating a cheese burger
+1585,A structure made of people standing on top of other people
+1586,Genesis
+1587,
+1588,"Oh the Death, not pigs forever."
+1589,The Starry Night
+1590,Persephone
+1591,a beautiful person
+1592,Sickness of the Soul
+1593,turnt brony undergrad dwight
+1594,a gilded lily
+1595,Photograph of a glass of Blue Tea
+1596,a woman and a crow
+1597,
+1598,a beautiful person
+1599,turnt brony undergrad dwight
+1600,mammals
+1601,The Lost Generation
+1602,a goblin by van gogh
+1603,A black and white photo of a rainbow.
+1604,"""Your mind flails in the gaps"" - by Ryan Murdock"
+1605,"half Ryan, half pigeon"
+1606,An Arundel Tomb
+1607,pasta ömetabolism
+1608,A dandelion blown into the universe
+1609,a man at the beach
+1610,Monet Lisa
+1611,"r.j. Murdock's ""The Death of a Hacker"""
+1612,Saturn being a good dad to his son
+1613,The Starry Night
+1614,a beautiful person
+1615,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1616,an old man
+1617,an intricate sculpture of Death itself
+1618,Genesis
+1619,a cherry tree made of fractals
+1620,a beautiful woman
+1621,a beautiful woman
+1622,an illustration of a baby daikon radish in a tutu walking a dog
+1623,
+1624,the latent space
+1625,A dead man
+1626,
+1627,frogs in the style of Ralph Steadman
+1628,a cherry tree made of fractals
+1629,fuzzy pals hum
+1630,a tiny church inside an eyeball
+1631,Aflame
+1632,a sunflower
+1633,Nostos
+1634,Monet Lisa
+1635,Monet Lisa
+1636,a cherry tree made of fractals
+1637,Cat in a teacup
+1638,I miss the Spring
+1639,a beautiful person
+1640,Redacted ████████
+1641,"God, it's amazing."
+1642,a portrait of <name>
+1643,Shrek the ogre
+1644,Super Mario World but every character is Luigi
+1645,God killed Van Gogh.
+1646,"A cute, minmimalist valentine's day card featuring a cat"
+1647,She's gorgeous
+1648,a sunflower
+1649,the sun is shining on the lake
+1650,the intersection of art and technology
+1651,a beautiful woman
+1652,a beautiful painting
+1653,Paradise Lost
+1654,president abe lincoln but a cat
+1655,
+1656,"""The Penultimate Supper"" by Da Vinci"
+1657,On the edge of endless darkness
+1658,With the Gods in envy of their visions
+1659,Dril is a cyber-philosopher.
+1660,"r.j. Murdock's ""The Death of a Hacker"""
+1661,
+1662,a picture of Ryan Murdock
+1663,A E S T H E T I C ?
+1664,deepdream aka inceptionism
+1665,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+1666,a beautiful woman
+1667,Homer Simpson
+1668,Persephone
+1669,the whitest man
+1670,handsome commemorative garden pigeon
+1671,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+1672,a minimalist painting that you wouldn't understand
+1673,a beautiful person
+1674,Monet Lisa
+1675,Monet Lisa
+1676,cult of prisms
+1677,"a ""This machine kills Trojans"" sticker on a Greek lyre"
+1678,The agony of time
+1679,turnt brony undergrad dwight
+1680,the whitest man
+1681,Dril is a cyber-philosopher.
+1682,Alan Turing
+1683,when the wind blows
+1684,a portrait of Persephone
+1685,deepdream aka inceptionism
+1686,Dead Codes by Ryan Murdock
+1687,Saturn being a good dad to his son
+1688,a portrait of Abraham Lincoln
+1689,The Theotokos is a bird
+1690,a beautiful woman
+1691,"i'm never gonna lose the desire to be loved. ""Oh the pain!! The pain! The agony!"""
+1692,a corgi
+1693,a green doG
+1694,A E S T H E T I C ?
+1695,
+1696,the intersection of art and technology
+1697,Dead Codes by Ryan Murdock
+1698,a cute rabbit
+1699,"God, it's amazing."
+1700,a silent palace
+1701,a wholesome clown. Not creepy at all
+1702,Exquisite LonelinessnnLurid art by Ryan Murdock
+1703,A structure made of people standing on top of other people
+1704,Dead Codes by Ryan Murdock
+1705,a gorgeous bouquet with roses and sunflowers
+1706,a portrait of <name>
+1707,intricate nothing
+1708,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1709,Metaphysics
+1710,using generated paint
+1711,a minimalist painting that you wouldn't understand
+1712,she sings opera
+1713,Cat in a teacup
+1714,turnt brony undergrad dwight
+1715,a beautiful woman
+1716,"""The hunger artist, full"" by Ryan Murdock"
+1717,The years gild our memoriesnUnfairly.
+1718,a woman and a crow
+1719,A vanitas still life that features twitter follower counts
+1720,The Monet Lisa
+1721,a gorgeous bouquet with roses and sunflowers
+1722,Philosophy is really homesickness: the urge to be at home everywhere
+1723,a green doG
+1724,an omen
+1725,An elegant image of nature with gorgeous swirling gradients by R.J. Murdock
+1726,a cute corgi
+1727,cowboy with a trumpet
+1728,"The laptop of brave Achaean Achilles, who would not live long."
+1729,a portrait of a beautiful woman
+1730,slightly mild cosplaying pseudo beard
+1731,a man standing alone in a wheat field
+1732,Aflame
+1733,a portrait of Persephone
+1734,a woman and a crow
+1735,I sold my soul at the crossroads
+1736,the demise of the universe
+1737,a portrait of a beautiful person
+1738,"Mephisto, shrouded in smoke"
+1739,a portrait of advadnoun
+1740,God is in heaven and all is right in the world
+1741,a cherry tree made of fractals
+1742,Odysseus speaks to the shades in Hades
+1743,a steampunk technomancer
+1744,a woman and a crow
+1745,treehouse in the style of studio ghibli animation
+1746,a gorgeous bouquet with roses and sunflowers
+1747,🎷
+1748,a cherry tree made of fractals
+1749,"A cute, minmimalist valentine's day card featuring a cat"
+1750,a famous painted portrait of Lady Macbeth
+1751,pasta ömetabolism
+1752,A short life full of immense joy
+1753,a terrifying night hag
+1754,a horse with four eyes.
+1755,A baroque portrait of Hamlet
+1756,this person is
+1757,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1758,"a brilliant sketch titled ""Let Forever be Delayed"""
+1759,baby metal
+1760,a character from a ghibli movie
+1761,a corgi
+1762,the massive hope nof early iterations
+1763,a portrait of a beautiful person
+1764,Intimations of Immortality
+1765,a silent palace
+1766,Post-post-post-post-modern art
+1767,a person's face
+1768,"r.j. Murdock's ""The Death of a Hacker"""
+1769,a cherry tree made of fractals
+1770,Ophelia
+1771,A E S T H E T I C ?
+1772,
+1773,
+1774,Genesis
+1775,Persephone
+1776,Last Breath
+1777,a portrait of Abraham Lincoln
+1778,The OLD DATA
+1779,the whitest man
+1780,a minimalist painting that you wouldn't understand
+1781,God once loved a woman
+1782,totemic dusk
+1783,when the wind blows
+1784,treehouse in the style of studio ghibli animation
+1785,a corgi
+1786,Last Breath
+1787,slightly mild cosplaying pseudo beard
+1788,a portrait of a beautiful woman
+1789,
+1790,a photo from {my hometown}
+1791,Dancing in the moonlight
+1792,Everywhere is no-place
+1793,Post-post-post-post-modern art
+1794,👉  👈
+1795,
+1796,a woman and a crow
+1797,"half Ryan, half pigeon"
+1798,president abe lincoln but a cat
+1799,A propaganda poster promoting big chungus
+1800,"""The hunger artist, full"" by Ryan Murdock"
+1801,a painting that couldn't be sold
+1802,a beautiful haunting
+1803,a technomancer
+1804,"""A God Made of Wires and Dust"" by Ryan Murdock"
+1805,little birds
+1806,"""The hunger artist, full"" by Ryan Murdock"
+1807,"""The hunger artist, full"" by Ryan Murdock"
+1808,rooted reflected worries
+1809,is this loss? but it's van gogh
+1810,a portrait of <name>
+1811,a beautiful person
+1812,a photo portrait of Joe Bidenthulu
+1813,a dog eating a cheese burger
+1814,Aflame
+1815,"a brilliant sketch titled ""Let Forever be Delayed"""
+1816,Aflame
+1817,Aflame
+1818,a beautiful haunting
+1819,totemic dusk
+1820,"""The hunger artist, full"" by Ryan Murdock"
+1821,Intimations of Immortality
+1822,"""Your mind fails in the gaps"" - by Ryan Murdock"
+1823,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1824,a dog.
+1825,a green doG
+1826,The Lost Generation
+1827,Last Breath
+1828,intricate nothing
+1829,"God, it's amazing."
+1830,this person is
+1831,a silent palace
+1832,a dog eating a cheese burger
+1833,Genesis
+1834,a calm still life in ethereal blue
+1835,slightly mild cosplaying pseudo beard
+1836,A propaganda poster promoting big chungus
+1837,is this loss? but it's van gogh
+1838,Dancing in the moonlight
+1839,a corgi
+1840,🔴~__��'t �
+1841,totemic dusk
+1842,a ginormous baby
+1843,Dancing in the moonlight
+1844,a photo from {my hometown}
+1845,a beautiful Waluigi
+1846,human
+1847,A black and white photo of a rainbow.
+1848,a beautiful person
+1849,"""Cameras can't make art""nnAn oil on canvas by Murdock"
+1850,a cherry tree made of fractals
+1851,a beautiful person
+1852,Taylor Swift
+1853,a man on fire
+1854,Post-Modern Nouveaux Statue
+1855,is this loss? but it's van gogh
+1856,a man at the beach
+1857,a beautiful person
+1858,"""The hunger artist, full"" by Ryan Murdock"
+1859,The OLD DATA
+1860,Dancing in the moonlight
+1861,A structure made of people standing on top of other people
+1862,a horse with four eyes.
+1863,�>: ican read wii
+1864,a portrait of Abraham Lincoln
+1865,A propaganda poster for chunky cats.
+1866,
+1867,The Death of Achilles
+1868,on the edge of grace
+1869,I did not mean it I wanted a cute clever cartoon I swear.
+1870,a handwritten obituary
+1871,a man standing alone in a wheat field
+1872,the intersection of art and technology
+1873,Memento Mori
+1874,a portrait of a beautiful woman
+1875,cigar sammycorgi
+1876,a steampunk technomancer
+1877,"Sons are like birds, flying always over the mountain"
+1878,The Lost Generation
+1879,a minimalist painting that you wouldn't understand
+1880,A black and white photo of a rainbow.
+1881,a man holding an apple in one hand
+1882,🔴~__��'t �
+1883,🍰  🇺 🎓 🐶
+1884,a man holding an apple in one hand
+1885,a sketch of the mind of god
+1886,treehouse in the style of studio ghibli animation
+1887,Beauty here -- a photograph by Ryan Murdock
+1888,A E S T H E T I C ?
+1889,a selfie
+1890,is this loss? but it's van gogh
+1891,Costco wedding
+1892,a beautiful person
+1893,a green doG
+1894,symmetry
+1895,a dog eating a cheese burger
+1896,a summer day
+1897,"""A God Made of Wires and Dust"" by Ryan Murdock"
+1898,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1899,a portrait of a beautiful woman
+1900,зеленая собака
+1901,"joy, happiness, bliss"
+1902,Juliet
+1903,a wholesome clown. Not creepy at all
+1904,meaningless neko ♡♡ neko
+1905,I can read when there's writing on the wall
+1906,"Oh the Death, not pigs forever."
+1907,a minimalist painting that you wouldn't understand
+1908,Aflame
+1909,Super Mario World but every character is Luigi
+1910,/
+1911,Dead Codes by Ryan Murdock
+1912,A vanitas still life that features twitter follower counts
+1913,a beautiful woman
+1914,a lamp
+1915,
+1916,the eyes of God are wired shut
+1917,intricate nothing
+1918,Is this loss?
+1919,a photo of a purple dog
+1920,a lamp
+1921,totemic dusk
+1922,The average Advadnoun twitter follower
+1923,photosynthesis
+1924,Costco wedding
+1925,🔴~__��'t �
+1926,Aflame
+1927,a cherry tree made of fractals
+1928,an intricate painting of eternity
+1929,Saturn being a good dad to his son
+1930,Nostos
+1931,a beautiful person
+1932,A gargoyle of wires and flesh
+1933,🎷
+1934,a beautiful person
+1935,a tasteful nude
+1936,Faceless Sorrow
+1937,a gorgeous bouquet with roses and sunflowers
+1938,using generated paint
+1939,A Tragedy
+1940,зеленая собака
+1941,🔴~__��'t �
+1942,A Tragedy
+1943,A sticky-note magnum opus featuring birds
+1944,president abe lincoln but a cat
+1945,using generated paint
+1946,
+1947,Intimations of Immortality
+1948,a portrait of <name>
+1949,a silent palace
+1950,A poster advertising death by water
+1951,A propaganda poster promoting big chungus
+1952,totemic dusk
+1953,a horse with four eyes.
+1954,cigar sammycorgi
+1955,"""It is beginning to end.""nby Ryan Murdock."
+1956,all architecture
+1957,a portrait of Abraham Lincoln
+1958,"joy, happiness, bliss"
+1959,a man with a beard
+1960,Genesis
+1961,👉  👈
+1962,Summer's Symphony: Counterpoint and Melody
+1963,A gun killed Van Gogh.
+1964,snazzy snazzy myspace cosplaying undergrad lookin cosplaying jared
+1965,A minimalist propaganda poster promoting panpsychism
+1966,Persephone
+1967,a goblin by van gogh
+1968,"""A new hope blooms on the long notes of old horns."""
+1969,a painting of the city
+1970,
+1971,The agony of time
+1972,Ophelia
+1973,turnt brony undergrad dwight
+1974,a beautiful person
+1975,totemic dusk
+1976,The Fool tarot card but it's The Lovers
+1977,
+1978,a broken heart
+1979,"Rise, Oink, Lazarus of Bethany"
+1980,"""The hunger artist, full"" by Ryan Murdock"
+1981,a cherry tree made of fractals
+1982,an intricate painting of eternity
+1983,She's gorgeous
+1984,a beautiful person
+1985,I will meet you in a field firmly set within wrong.nnBy Ryan Murdock
+1986,using generated paint
+1987,a portrait of Abe Lincoln
+1988,Persephone flees Hades
+1989,a steampunk technomancer
+1990,a beautiful woman
+1991,"A portrait: man, whose lineage is corpse."
+1992,🔴~__��'t �
+1993,Intimations of Immortality
+1994,an omen
+1995,Persephone
+1996,"God closes a door, boards up stained-glass windows."
+1997,"""A new hope blooms on the long notes of old horns."""
+1998,Fire
+1999,
+2000,Metaphysics
+2001,"""The hunger artist, full"" by Ryan Murdock"
+2002,when the wind blows
+2003,a portrait of a beautiful person
+2004,The Lost Generation
+2005,a corgi
+2006,a beautiful woman
+2007,pasta ömetabolism
+2008,a sad man
+2009,Juliet
+2010,a painting of a sycamore in
+2011,a portrait of Abraham Lincoln
+2012,The Fates knit such delicate nooses for us to bind
+2013,a photo from {my hometown}
+2014,a tree with leaves that are amarillo sightseeing thetic
+2015,Sickness of the Soul
+2016,pasta ömetabolism
+2017,pasta ömetabolism
+2018,bored of dying
+2019,An Arundel Tomb
+2020,The Starry Night
+2021,Nostos
+2022,bored of dying
+2023,The Lost Generation
+2024,The average Advadnoun twitter follower
+2025,pathoarthistory evankirstel sleep depend npainter ☼ nightmare comprehend
+2026,a silent palace
+2027,beautiful art
+2028,
+2029,Last Breath
+2030,
+2031,a tasteful nude
+2032,a portrait of advadnoun
+2033,a portrait of a beautiful person
+2034,a man holding an apple in one hand
+2035,a gorgeous bouquet with roses and sunflowers
+2036,photosynthesis
+2037,God killed Van Gogh.
+2038,Saturn being a good dad to his son
+2039,a horse with four eyes.
+2040,a beautiful woman
+2041,a beautiful person
+2042,a portrait of Abe Lincoln
+2043,totemic dusk
+2044,A Tragedy
+2045,Persephone
+2046,The OLD DATA
+2047,"Elvis holding a rabbit. A detailed, high-quality photo without distortions"
+2048,face like an M.C. Escher drawing n(you could get lost in its beauty)
+2049,Dead Codes by Ryan Murdock
+2050,Intimations of Immortality
+2051,turnt brony undergrad dwight
+2052,a photo of a purple dog
+2053,Cat in a teacup
+2054,🔴~__��'t �
+2055,turnt brony undergrad dwight
+2056,Beauty here -- a photo by r.j. Murdock
+2057,The Fool
+2058,a portrait of Juliet
+2059,a jukebox powered by smoke
+2060,cowboy with a trumpet
+2061,twilight
+2062,"joy, happiness, bliss"
+2063,Dead Codes by Ryan Murdock
+2064,"a brilliant sketch titled ""Let Forever be Delayed"""
+2065,tamine ethereal image
+2066,a portrait of <name>
+2067,"God, it's amazing."
+2068,she came in through the wall
+2069,Fire
+2070,Juliet
+2071,God killed Van Gogh.
+2072,a portrait of Persephone
+2073,a beautiful person
+2074,the whitest man
+2075,Somewhere where I am not.nIntricate beauty by Ryan Murdock.
+2076,a gilded lily
+2077,The Lost Generation
+2078,Dead Codes by Ryan Murdock
+2079,Intimations of Immortality
+2080,meaningless neko ♡♡ neko
+2081,beautiful art
+2082,"""The hunger artist, full"" by Ryan Murdock"
+2083,an intricate painting of eternity
+2084,Good grief
+2085,"a person with 2 eyes, one mouth, one nose, and no extra holes!"
+2086,The Fool