Upload 7 files
Browse files- examples/NLG/CODE_OF_CONDUCT.md +9 -0
- examples/NLG/LICENSE +21 -0
- examples/NLG/README.md +182 -0
- examples/NLG/SECURITY.md +41 -0
- examples/NLG/create_datasets.sh +44 -0
- examples/NLG/download_pretrained_checkpoints.sh +11 -0
- examples/NLG/requirement.txt +7 -0
examples/NLG/CODE_OF_CONDUCT.md
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Microsoft Open Source Code of Conduct
|
2 |
+
|
3 |
+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
4 |
+
|
5 |
+
Resources:
|
6 |
+
|
7 |
+
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
|
8 |
+
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
|
9 |
+
- Contact [[email protected]](mailto:[email protected]) with questions or concerns
|
examples/NLG/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) Microsoft Corporation.
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE
|
examples/NLG/README.md
ADDED
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Adapting GPT-2 using LoRA
|
2 |
+
|
3 |
+
This folder contains the implementation of LoRA in GPT-2 using the Python package `lora` and steps to replicate the results in our recent paper
|
4 |
+
|
5 |
+
**LoRA: Low-Rank Adaptation of Large Language Models** <br>
|
6 |
+
*Edward J. Hu\*, Yelong Shen\*, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen* <br>
|
7 |
+
Paper: https://arxiv.org/abs/2106.09685 <br>
|
8 |
+
|
9 |
+
<p>
|
10 |
+
<img src="figures/LoRA_GPT2.PNG" width="800" >
|
11 |
+
</p>
|
12 |
+
|
13 |
+
This repo reproduces our experiments on GPT-2.
|
14 |
+
|
15 |
+
## Repository Overview
|
16 |
+
|
17 |
+
Our implementation is based on the fine-tuning code for GPT-2 in [Hugging Face](https://huggingface.co/).
|
18 |
+
There are several directories in this repo:
|
19 |
+
* [src/](src) contains the source code used for data processing, training, and decoding.
|
20 |
+
* [eval/](eval) contains the code for task-specific evaluation scripts.
|
21 |
+
* [data/](data) contains the raw data we used in our experiments.
|
22 |
+
* [vocab/](vocab) contains the GPT-2 vocabulary files.
|
23 |
+
|
24 |
+
## Getting Started
|
25 |
+
|
26 |
+
1. You can start with the following docker image: `nvcr.io/nvidia/pytorch:20.03-py3` on a GPU-capable machine, but any generic PyTorch image should work.
|
27 |
+
```
|
28 |
+
docker run -it nvcr.io/nvidia/pytorch:20.03-py3
|
29 |
+
```
|
30 |
+
|
31 |
+
2. Clone the repo and install dependencies in a virtual environment (remove sudo if running in docker container):
|
32 |
+
```
|
33 |
+
sudo apt-get update
|
34 |
+
sudo apt-get -y install git jq virtualenv
|
35 |
+
git clone https://github.com/microsoft/LoRA.git; cd LoRA
|
36 |
+
virtualenv -p `which python3` ./venv
|
37 |
+
. ./venv/bin/activate
|
38 |
+
pip install -r requirement.txt
|
39 |
+
bash download_pretrained_checkpoints.sh
|
40 |
+
bash create_datasets.sh
|
41 |
+
cd ./eval
|
42 |
+
bash download_evalscript.sh
|
43 |
+
cd ..
|
44 |
+
```
|
45 |
+
|
46 |
+
#### Now we are ready to replicate the results in our paper.
|
47 |
+
|
48 |
+
## Replicating Our Result on E2E
|
49 |
+
|
50 |
+
1. Train GPT-2 Medium with LoRA (see our paper for hyperparameters for GPT-2 Medium)
|
51 |
+
```
|
52 |
+
python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_ft.py \
|
53 |
+
--train_data ./data/e2e/train.jsonl \
|
54 |
+
--valid_data ./data/e2e/valid.jsonl \
|
55 |
+
--train_batch_size 8 \
|
56 |
+
--grad_acc 1 \
|
57 |
+
--valid_batch_size 4 \
|
58 |
+
--seq_len 512 \
|
59 |
+
--model_card gpt2.md \
|
60 |
+
--init_checkpoint ./pretrained_checkpoints/gpt2-medium-pytorch_model.bin \
|
61 |
+
--platform local \
|
62 |
+
--clip 0.0 \
|
63 |
+
--lr 0.0002 \
|
64 |
+
--weight_decay 0.01 \
|
65 |
+
--correct_bias \
|
66 |
+
--adam_beta2 0.999 \
|
67 |
+
--scheduler linear \
|
68 |
+
--warmup_step 500 \
|
69 |
+
--max_epoch 5 \
|
70 |
+
--save_interval 1000 \
|
71 |
+
--lora_dim 4 \
|
72 |
+
--lora_alpha 32 \
|
73 |
+
--lora_dropout 0.1 \
|
74 |
+
--label_smooth 0.1 \
|
75 |
+
--work_dir ./trained_models/GPT2_M/e2e \
|
76 |
+
--random_seed 110
|
77 |
+
```
|
78 |
+
|
79 |
+
2. Generate outputs from the trained model using beam search:
|
80 |
+
```
|
81 |
+
python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_beam.py \
|
82 |
+
--data ./data/e2e/test.jsonl \
|
83 |
+
--batch_size 1 \
|
84 |
+
--seq_len 512 \
|
85 |
+
--eval_len 64 \
|
86 |
+
--model_card gpt2.md \
|
87 |
+
--init_checkpoint ./trained_models/GPT2_M/e2e/model.26289.pt \
|
88 |
+
--platform local \
|
89 |
+
--lora_dim 4 \
|
90 |
+
--lora_alpha 32 \
|
91 |
+
--beam 10 \
|
92 |
+
--length_penalty 0.8 \
|
93 |
+
--no_repeat_ngram_size 4 \
|
94 |
+
--repetition_penalty 1.0 \
|
95 |
+
--eos_token_id 628 \
|
96 |
+
--work_dir ./trained_models/GPT2_M/e2e \
|
97 |
+
--output_file predict.26289.b10p08r4.jsonl
|
98 |
+
```
|
99 |
+
|
100 |
+
3. Decode outputs from step (2)
|
101 |
+
```
|
102 |
+
python src/gpt2_decode.py \
|
103 |
+
--vocab ./vocab \
|
104 |
+
--sample_file ./trained_models/GPT2_M/e2e/predict.26289.b10p08r4.jsonl \
|
105 |
+
--input_file ./data/e2e/test_formatted.jsonl \
|
106 |
+
--output_ref_file e2e_ref.txt \
|
107 |
+
--output_pred_file e2e_pred.txt
|
108 |
+
```
|
109 |
+
|
110 |
+
4. Run evaluation on E2E test set
|
111 |
+
|
112 |
+
```
|
113 |
+
python eval/e2e/measure_scores.py e2e_ref.txt e2e_pred.txt -p
|
114 |
+
```
|
115 |
+
|
116 |
+
## Replicating Our Result on WebNLG
|
117 |
+
|
118 |
+
1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with webnlg (see our paper for hyperparameters)
|
119 |
+
|
120 |
+
2. Decode outputs from beam search (step 2 above)
|
121 |
+
```
|
122 |
+
python src/gpt2_decode.py \
|
123 |
+
--vocab ./vocab \
|
124 |
+
--sample_file ./trained_models/GPT2_M/webnlg/predict.20000.b10p08.jsonl \
|
125 |
+
--input_file ./data/webnlg_challenge_2017/test_formatted.jsonl \
|
126 |
+
--ref_type webnlg \
|
127 |
+
--ref_num 6 \
|
128 |
+
--output_ref_file eval/GenerationEval/data/references_webnlg \
|
129 |
+
--output_pred_file eval/GenerationEval/data/hypothesis_webnlg \
|
130 |
+
--tokenize --lower
|
131 |
+
```
|
132 |
+
|
133 |
+
3. Run evaluation on WebNLG test set
|
134 |
+
```
|
135 |
+
cd ./eval/GenerationEval/
|
136 |
+
python eval.py \
|
137 |
+
-R data/references_webnlg/reference \
|
138 |
+
-H data/hypothesis_webnlg \
|
139 |
+
-nr 6 \
|
140 |
+
-m bleu,meteor,ter
|
141 |
+
cd ../..
|
142 |
+
```
|
143 |
+
|
144 |
+
## Replicating Our Result on DART
|
145 |
+
|
146 |
+
1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with dart (see our paper for hyperparameters)
|
147 |
+
|
148 |
+
2. Decode outputs from beam search (step 2 above)
|
149 |
+
```
|
150 |
+
python src/gpt2_decode.py \
|
151 |
+
--vocab ./vocab \
|
152 |
+
--sample_file ./trained_models/GPT2_M/dart/predict.20000.b10p08.jsonl \
|
153 |
+
--input_file ./data/dart/test_formatted.jsonl \
|
154 |
+
--ref_type dart \
|
155 |
+
--ref_num 6 \
|
156 |
+
--output_ref_file eval/GenerationEval/data/references_dart \
|
157 |
+
--output_pred_file eval/GenerationEval/data/hypothesis_dart \
|
158 |
+
--tokenize --lower
|
159 |
+
```
|
160 |
+
|
161 |
+
3. Run evaluation on Dart test set
|
162 |
+
```
|
163 |
+
cd ./eval/GenerationEval/
|
164 |
+
python eval.py \
|
165 |
+
-R data/references_dart/reference \
|
166 |
+
-H data/hypothesis_dart \
|
167 |
+
-nr 6 \
|
168 |
+
-m bleu,meteor,ter
|
169 |
+
cd ../..
|
170 |
+
```
|
171 |
+
|
172 |
+
## Citation
|
173 |
+
```
|
174 |
+
@misc{hu2021lora,
|
175 |
+
title={LoRA: Low-Rank Adaptation of Large Language Models},
|
176 |
+
author={Hu, Edward and Shen, Yelong and Wallis, Phil and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Chen, Weizhu},
|
177 |
+
year={2021},
|
178 |
+
eprint={2106.09685},
|
179 |
+
archivePrefix={arXiv},
|
180 |
+
primaryClass={cs.CL}
|
181 |
+
}
|
182 |
+
```
|
examples/NLG/SECURITY.md
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.5 BLOCK -->
|
2 |
+
|
3 |
+
## Security
|
4 |
+
|
5 |
+
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
|
6 |
+
|
7 |
+
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below.
|
8 |
+
|
9 |
+
## Reporting Security Issues
|
10 |
+
|
11 |
+
**Please do not report security vulnerabilities through public GitHub issues.**
|
12 |
+
|
13 |
+
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
|
14 |
+
|
15 |
+
If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
|
16 |
+
|
17 |
+
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
|
18 |
+
|
19 |
+
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
|
20 |
+
|
21 |
+
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
|
22 |
+
* Full paths of source file(s) related to the manifestation of the issue
|
23 |
+
* The location of the affected source code (tag/branch/commit or direct URL)
|
24 |
+
* Any special configuration required to reproduce the issue
|
25 |
+
* Step-by-step instructions to reproduce the issue
|
26 |
+
* Proof-of-concept or exploit code (if possible)
|
27 |
+
* Impact of the issue, including how an attacker might exploit the issue
|
28 |
+
|
29 |
+
This information will help us triage your report more quickly.
|
30 |
+
|
31 |
+
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
|
32 |
+
|
33 |
+
## Preferred Languages
|
34 |
+
|
35 |
+
We prefer all communications to be in English.
|
36 |
+
|
37 |
+
## Policy
|
38 |
+
|
39 |
+
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
|
40 |
+
|
41 |
+
<!-- END MICROSOFT SECURITY.MD BLOCK -->
|
examples/NLG/create_datasets.sh
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
|
3 |
+
echo "creating e2e datasets..."
|
4 |
+
path=data/e2e
|
5 |
+
echo "train..."
|
6 |
+
python src/format_converting_e2e.py $path/train.txt $path/train_formatted.jsonl
|
7 |
+
python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
|
8 |
+
echo "test..."
|
9 |
+
python src/format_converting_e2e.py $path/test.txt $path/test_formatted.jsonl
|
10 |
+
python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
|
11 |
+
|
12 |
+
echo "valid..."
|
13 |
+
python src/format_converting_e2e.py $path/valid.txt $path/valid_formatted.jsonl
|
14 |
+
python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
|
15 |
+
|
16 |
+
echo "creating webnlg datasets..."
|
17 |
+
path=data/webnlg_challenge_2017
|
18 |
+
echo "train..."
|
19 |
+
python src/format_converting_webnlg.py $path/train.json $path/train_formatted.jsonl
|
20 |
+
python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
|
21 |
+
|
22 |
+
echo "test..."
|
23 |
+
python src/format_converting_webnlg.py $path/test.json $path/test_formatted.jsonl
|
24 |
+
python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
|
25 |
+
|
26 |
+
echo "valid..."
|
27 |
+
python src/format_converting_webnlg.py $path/dev.json $path/valid_formatted.jsonl
|
28 |
+
python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
|
29 |
+
|
30 |
+
echo "creating dart datasets..."
|
31 |
+
path=data/dart
|
32 |
+
echo "train..."
|
33 |
+
python src/format_converting_dart.py data/dart/dart-v1.1.1-full-train.json data/dart/train_formatted.jsonl
|
34 |
+
python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
|
35 |
+
|
36 |
+
echo "test..."
|
37 |
+
python src/format_converting_dart.py data/dart/dart-v1.1.1-full-test.json data/dart/test_formatted.jsonl
|
38 |
+
python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
|
39 |
+
|
40 |
+
echo "valid..."
|
41 |
+
python src/format_converting_dart.py data/dart/dart-v1.1.1-full-dev.json data/dart/valid_formatted.jsonl
|
42 |
+
python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
|
43 |
+
|
44 |
+
echo "script complete!"
|
examples/NLG/download_pretrained_checkpoints.sh
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
|
3 |
+
echo "downloading pretrained model checkpoints..."
|
4 |
+
mkdir pretrained_checkpoints
|
5 |
+
cd pretrained_checkpoints
|
6 |
+
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
|
7 |
+
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin
|
8 |
+
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin
|
9 |
+
cd ..
|
10 |
+
|
11 |
+
echo "script complete!"
|
examples/NLG/requirement.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
--find-links https://download.pytorch.org/whl/torch_stable.html
|
2 |
+
torch==1.7.1+cu101
|
3 |
+
transformers==3.3.1
|
4 |
+
spacy
|
5 |
+
tqdm
|
6 |
+
tensorboard
|
7 |
+
progress
|