FZH1996 commited on
Commit
e12dbbd
·
1 Parent(s): 929f598

Upload 7 files

Browse files
examples/NLG/CODE_OF_CONDUCT.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # Microsoft Open Source Code of Conduct
2
+
3
+ This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
4
+
5
+ Resources:
6
+
7
+ - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
8
+ - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
9
+ - Contact [[email protected]](mailto:[email protected]) with questions or concerns
examples/NLG/LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) Microsoft Corporation.
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE
examples/NLG/README.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adapting GPT-2 using LoRA
2
+
3
+ This folder contains the implementation of LoRA in GPT-2 using the Python package `lora` and steps to replicate the results in our recent paper
4
+
5
+ **LoRA: Low-Rank Adaptation of Large Language Models** <br>
6
+ *Edward J. Hu\*, Yelong Shen\*, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen* <br>
7
+ Paper: https://arxiv.org/abs/2106.09685 <br>
8
+
9
+ <p>
10
+ <img src="figures/LoRA_GPT2.PNG" width="800" >
11
+ </p>
12
+
13
+ This repo reproduces our experiments on GPT-2.
14
+
15
+ ## Repository Overview
16
+
17
+ Our implementation is based on the fine-tuning code for GPT-2 in [Hugging Face](https://huggingface.co/).
18
+ There are several directories in this repo:
19
+ * [src/](src) contains the source code used for data processing, training, and decoding.
20
+ * [eval/](eval) contains the code for task-specific evaluation scripts.
21
+ * [data/](data) contains the raw data we used in our experiments.
22
+ * [vocab/](vocab) contains the GPT-2 vocabulary files.
23
+
24
+ ## Getting Started
25
+
26
+ 1. You can start with the following docker image: `nvcr.io/nvidia/pytorch:20.03-py3` on a GPU-capable machine, but any generic PyTorch image should work.
27
+ ```
28
+ docker run -it nvcr.io/nvidia/pytorch:20.03-py3
29
+ ```
30
+
31
+ 2. Clone the repo and install dependencies in a virtual environment (remove sudo if running in docker container):
32
+ ```
33
+ sudo apt-get update
34
+ sudo apt-get -y install git jq virtualenv
35
+ git clone https://github.com/microsoft/LoRA.git; cd LoRA
36
+ virtualenv -p `which python3` ./venv
37
+ . ./venv/bin/activate
38
+ pip install -r requirement.txt
39
+ bash download_pretrained_checkpoints.sh
40
+ bash create_datasets.sh
41
+ cd ./eval
42
+ bash download_evalscript.sh
43
+ cd ..
44
+ ```
45
+
46
+ #### Now we are ready to replicate the results in our paper.
47
+
48
+ ## Replicating Our Result on E2E
49
+
50
+ 1. Train GPT-2 Medium with LoRA (see our paper for hyperparameters for GPT-2 Medium)
51
+ ```
52
+ python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_ft.py \
53
+ --train_data ./data/e2e/train.jsonl \
54
+ --valid_data ./data/e2e/valid.jsonl \
55
+ --train_batch_size 8 \
56
+ --grad_acc 1 \
57
+ --valid_batch_size 4 \
58
+ --seq_len 512 \
59
+ --model_card gpt2.md \
60
+ --init_checkpoint ./pretrained_checkpoints/gpt2-medium-pytorch_model.bin \
61
+ --platform local \
62
+ --clip 0.0 \
63
+ --lr 0.0002 \
64
+ --weight_decay 0.01 \
65
+ --correct_bias \
66
+ --adam_beta2 0.999 \
67
+ --scheduler linear \
68
+ --warmup_step 500 \
69
+ --max_epoch 5 \
70
+ --save_interval 1000 \
71
+ --lora_dim 4 \
72
+ --lora_alpha 32 \
73
+ --lora_dropout 0.1 \
74
+ --label_smooth 0.1 \
75
+ --work_dir ./trained_models/GPT2_M/e2e \
76
+ --random_seed 110
77
+ ```
78
+
79
+ 2. Generate outputs from the trained model using beam search:
80
+ ```
81
+ python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_beam.py \
82
+ --data ./data/e2e/test.jsonl \
83
+ --batch_size 1 \
84
+ --seq_len 512 \
85
+ --eval_len 64 \
86
+ --model_card gpt2.md \
87
+ --init_checkpoint ./trained_models/GPT2_M/e2e/model.26289.pt \
88
+ --platform local \
89
+ --lora_dim 4 \
90
+ --lora_alpha 32 \
91
+ --beam 10 \
92
+ --length_penalty 0.8 \
93
+ --no_repeat_ngram_size 4 \
94
+ --repetition_penalty 1.0 \
95
+ --eos_token_id 628 \
96
+ --work_dir ./trained_models/GPT2_M/e2e \
97
+ --output_file predict.26289.b10p08r4.jsonl
98
+ ```
99
+
100
+ 3. Decode outputs from step (2)
101
+ ```
102
+ python src/gpt2_decode.py \
103
+ --vocab ./vocab \
104
+ --sample_file ./trained_models/GPT2_M/e2e/predict.26289.b10p08r4.jsonl \
105
+ --input_file ./data/e2e/test_formatted.jsonl \
106
+ --output_ref_file e2e_ref.txt \
107
+ --output_pred_file e2e_pred.txt
108
+ ```
109
+
110
+ 4. Run evaluation on E2E test set
111
+
112
+ ```
113
+ python eval/e2e/measure_scores.py e2e_ref.txt e2e_pred.txt -p
114
+ ```
115
+
116
+ ## Replicating Our Result on WebNLG
117
+
118
+ 1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with webnlg (see our paper for hyperparameters)
119
+
120
+ 2. Decode outputs from beam search (step 2 above)
121
+ ```
122
+ python src/gpt2_decode.py \
123
+ --vocab ./vocab \
124
+ --sample_file ./trained_models/GPT2_M/webnlg/predict.20000.b10p08.jsonl \
125
+ --input_file ./data/webnlg_challenge_2017/test_formatted.jsonl \
126
+ --ref_type webnlg \
127
+ --ref_num 6 \
128
+ --output_ref_file eval/GenerationEval/data/references_webnlg \
129
+ --output_pred_file eval/GenerationEval/data/hypothesis_webnlg \
130
+ --tokenize --lower
131
+ ```
132
+
133
+ 3. Run evaluation on WebNLG test set
134
+ ```
135
+ cd ./eval/GenerationEval/
136
+ python eval.py \
137
+ -R data/references_webnlg/reference \
138
+ -H data/hypothesis_webnlg \
139
+ -nr 6 \
140
+ -m bleu,meteor,ter
141
+ cd ../..
142
+ ```
143
+
144
+ ## Replicating Our Result on DART
145
+
146
+ 1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with dart (see our paper for hyperparameters)
147
+
148
+ 2. Decode outputs from beam search (step 2 above)
149
+ ```
150
+ python src/gpt2_decode.py \
151
+ --vocab ./vocab \
152
+ --sample_file ./trained_models/GPT2_M/dart/predict.20000.b10p08.jsonl \
153
+ --input_file ./data/dart/test_formatted.jsonl \
154
+ --ref_type dart \
155
+ --ref_num 6 \
156
+ --output_ref_file eval/GenerationEval/data/references_dart \
157
+ --output_pred_file eval/GenerationEval/data/hypothesis_dart \
158
+ --tokenize --lower
159
+ ```
160
+
161
+ 3. Run evaluation on Dart test set
162
+ ```
163
+ cd ./eval/GenerationEval/
164
+ python eval.py \
165
+ -R data/references_dart/reference \
166
+ -H data/hypothesis_dart \
167
+ -nr 6 \
168
+ -m bleu,meteor,ter
169
+ cd ../..
170
+ ```
171
+
172
+ ## Citation
173
+ ```
174
+ @misc{hu2021lora,
175
+ title={LoRA: Low-Rank Adaptation of Large Language Models},
176
+ author={Hu, Edward and Shen, Yelong and Wallis, Phil and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Chen, Weizhu},
177
+ year={2021},
178
+ eprint={2106.09685},
179
+ archivePrefix={arXiv},
180
+ primaryClass={cs.CL}
181
+ }
182
+ ```
examples/NLG/SECURITY.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- BEGIN MICROSOFT SECURITY.MD V0.0.5 BLOCK -->
2
+
3
+ ## Security
4
+
5
+ Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
6
+
7
+ If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below.
8
+
9
+ ## Reporting Security Issues
10
+
11
+ **Please do not report security vulnerabilities through public GitHub issues.**
12
+
13
+ Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
14
+
15
+ If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
16
+
17
+ You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
18
+
19
+ Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20
+
21
+ * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22
+ * Full paths of source file(s) related to the manifestation of the issue
23
+ * The location of the affected source code (tag/branch/commit or direct URL)
24
+ * Any special configuration required to reproduce the issue
25
+ * Step-by-step instructions to reproduce the issue
26
+ * Proof-of-concept or exploit code (if possible)
27
+ * Impact of the issue, including how an attacker might exploit the issue
28
+
29
+ This information will help us triage your report more quickly.
30
+
31
+ If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
32
+
33
+ ## Preferred Languages
34
+
35
+ We prefer all communications to be in English.
36
+
37
+ ## Policy
38
+
39
+ Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
40
+
41
+ <!-- END MICROSOFT SECURITY.MD BLOCK -->
examples/NLG/create_datasets.sh ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ echo "creating e2e datasets..."
4
+ path=data/e2e
5
+ echo "train..."
6
+ python src/format_converting_e2e.py $path/train.txt $path/train_formatted.jsonl
7
+ python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
8
+ echo "test..."
9
+ python src/format_converting_e2e.py $path/test.txt $path/test_formatted.jsonl
10
+ python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
11
+
12
+ echo "valid..."
13
+ python src/format_converting_e2e.py $path/valid.txt $path/valid_formatted.jsonl
14
+ python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
15
+
16
+ echo "creating webnlg datasets..."
17
+ path=data/webnlg_challenge_2017
18
+ echo "train..."
19
+ python src/format_converting_webnlg.py $path/train.json $path/train_formatted.jsonl
20
+ python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
21
+
22
+ echo "test..."
23
+ python src/format_converting_webnlg.py $path/test.json $path/test_formatted.jsonl
24
+ python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
25
+
26
+ echo "valid..."
27
+ python src/format_converting_webnlg.py $path/dev.json $path/valid_formatted.jsonl
28
+ python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
29
+
30
+ echo "creating dart datasets..."
31
+ path=data/dart
32
+ echo "train..."
33
+ python src/format_converting_dart.py data/dart/dart-v1.1.1-full-train.json data/dart/train_formatted.jsonl
34
+ python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
35
+
36
+ echo "test..."
37
+ python src/format_converting_dart.py data/dart/dart-v1.1.1-full-test.json data/dart/test_formatted.jsonl
38
+ python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
39
+
40
+ echo "valid..."
41
+ python src/format_converting_dart.py data/dart/dart-v1.1.1-full-dev.json data/dart/valid_formatted.jsonl
42
+ python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
43
+
44
+ echo "script complete!"
examples/NLG/download_pretrained_checkpoints.sh ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ echo "downloading pretrained model checkpoints..."
4
+ mkdir pretrained_checkpoints
5
+ cd pretrained_checkpoints
6
+ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
7
+ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin
8
+ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin
9
+ cd ..
10
+
11
+ echo "script complete!"
examples/NLG/requirement.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ --find-links https://download.pytorch.org/whl/torch_stable.html
2
+ torch==1.7.1+cu101
3
+ transformers==3.3.1
4
+ spacy
5
+ tqdm
6
+ tensorboard
7
+ progress