FZH1996
/

fed-lora

Model card Files Files and versions Community

FZH1996 commited on Aug 17, 2023

Commit

e12dbbd

1 Parent(s): 929f598

Upload 7 files

Browse files

Files changed (7) hide show

examples/NLG/CODE_OF_CONDUCT.md +9 -0
examples/NLG/LICENSE +21 -0
examples/NLG/README.md +182 -0
examples/NLG/SECURITY.md +41 -0
examples/NLG/create_datasets.sh +44 -0
examples/NLG/download_pretrained_checkpoints.sh +11 -0
examples/NLG/requirement.txt +7 -0

examples/NLG/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,9 @@

+# Microsoft Open Source Code of Conduct
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+Resources:
+- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
+- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+- Contact [[email protected]](mailto:[email protected]) with questions or concerns

examples/NLG/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+    MIT License
+    Copyright (c) Microsoft Corporation.
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE

examples/NLG/README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+# Adapting GPT-2 using LoRA
+This folder contains the implementation of LoRA in GPT-2 using the Python package `lora` and steps to replicate the results in our recent paper
+**LoRA: Low-Rank Adaptation of Large Language Models** <br>
+*Edward J. Hu\*, Yelong Shen\*, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen* <br>
+Paper: https://arxiv.org/abs/2106.09685 <br>
+<p>
+<img src="figures/LoRA_GPT2.PNG" width="800" >
+</p>
+This repo reproduces our experiments on GPT-2.
+## Repository Overview
+Our implementation is based on the fine-tuning code for GPT-2 in [Hugging Face](https://huggingface.co/).
+There are several directories in this repo:
+* [src/](src) contains the source code used for data processing, training, and decoding.
+* [eval/](eval) contains the code for task-specific evaluation scripts.
+* [data/](data) contains the raw data we used in our experiments.
+* [vocab/](vocab) contains the GPT-2 vocabulary files.
+## Getting Started
+ 1. You can start with the following docker image: `nvcr.io/nvidia/pytorch:20.03-py3` on a GPU-capable machine, but any generic PyTorch image should work.
+ ```
+ docker run -it nvcr.io/nvidia/pytorch:20.03-py3
+ ```
+ 2. Clone the repo and install dependencies in a virtual environment (remove sudo if running in docker container):
+ ```
+ sudo apt-get update
+ sudo apt-get -y install git jq virtualenv
+ git clone https://github.com/microsoft/LoRA.git; cd LoRA
+ virtualenv -p `which python3` ./venv
+ . ./venv/bin/activate
+ pip install -r requirement.txt
+ bash download_pretrained_checkpoints.sh
+ bash create_datasets.sh
+ cd ./eval
+ bash download_evalscript.sh
+ cd ..
+ ```
+#### Now we are ready to replicate the results in our paper.
+## Replicating Our Result on E2E
+1. Train GPT-2 Medium with LoRA (see our paper for hyperparameters for GPT-2 Medium)
+```
+python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_ft.py \
+    --train_data ./data/e2e/train.jsonl \
+    --valid_data ./data/e2e/valid.jsonl \
+    --train_batch_size 8 \
+    --grad_acc 1 \
+    --valid_batch_size 4 \
+    --seq_len 512 \
+    --model_card gpt2.md \
+    --init_checkpoint ./pretrained_checkpoints/gpt2-medium-pytorch_model.bin \
+    --platform local \
+    --clip 0.0 \
+    --lr 0.0002 \
+    --weight_decay 0.01 \
+    --correct_bias \
+    --adam_beta2 0.999 \
+    --scheduler linear \
+    --warmup_step 500 \
+    --max_epoch 5 \
+    --save_interval 1000 \
+    --lora_dim 4 \
+    --lora_alpha 32 \
+    --lora_dropout 0.1 \
+    --label_smooth 0.1 \
+    --work_dir ./trained_models/GPT2_M/e2e \
+    --random_seed 110
+```
+2. Generate outputs from the trained model using beam search:
+```
+python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_beam.py \
+    --data ./data/e2e/test.jsonl \
+    --batch_size 1 \
+    --seq_len 512 \
+    --eval_len 64 \
+    --model_card gpt2.md \
+    --init_checkpoint ./trained_models/GPT2_M/e2e/model.26289.pt \
+    --platform local \
+    --lora_dim 4 \
+    --lora_alpha 32 \
+    --beam 10 \
+    --length_penalty 0.8 \
+    --no_repeat_ngram_size 4 \
+    --repetition_penalty 1.0 \
+    --eos_token_id 628 \
+    --work_dir ./trained_models/GPT2_M/e2e \
+    --output_file predict.26289.b10p08r4.jsonl
+```
+3. Decode outputs from step (2)
+```
+python src/gpt2_decode.py \
+    --vocab ./vocab \
+    --sample_file ./trained_models/GPT2_M/e2e/predict.26289.b10p08r4.jsonl \
+    --input_file ./data/e2e/test_formatted.jsonl \
+    --output_ref_file e2e_ref.txt \
+    --output_pred_file e2e_pred.txt
+```
+4. Run evaluation on E2E test set
+```
+python eval/e2e/measure_scores.py e2e_ref.txt e2e_pred.txt -p
+```
+## Replicating Our Result on WebNLG
+1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with webnlg (see our paper for hyperparameters)
+2. Decode outputs from beam search (step 2 above)
+```
+python src/gpt2_decode.py \
+    --vocab ./vocab \
+    --sample_file ./trained_models/GPT2_M/webnlg/predict.20000.b10p08.jsonl \
+    --input_file ./data/webnlg_challenge_2017/test_formatted.jsonl \
+    --ref_type webnlg \
+    --ref_num 6 \
+    --output_ref_file eval/GenerationEval/data/references_webnlg \
+    --output_pred_file eval/GenerationEval/data/hypothesis_webnlg \
+    --tokenize --lower
+```
+3. Run evaluation on WebNLG test set
+```
+cd ./eval/GenerationEval/
+python eval.py \
+    -R data/references_webnlg/reference \
+    -H data/hypothesis_webnlg \
+    -nr 6 \
+    -m bleu,meteor,ter
+cd ../..
+```
+## Replicating Our Result on DART
+1. Follow steps 1 and 2 from E2E pipeline by replacing references to E2E with dart (see our paper for hyperparameters)
+2. Decode outputs from beam search (step 2 above)
+```
+python src/gpt2_decode.py \
+        --vocab ./vocab \
+        --sample_file ./trained_models/GPT2_M/dart/predict.20000.b10p08.jsonl \
+        --input_file ./data/dart/test_formatted.jsonl \
+        --ref_type dart \
+        --ref_num 6 \
+        --output_ref_file eval/GenerationEval/data/references_dart \
+        --output_pred_file eval/GenerationEval/data/hypothesis_dart \
+        --tokenize --lower
+```
+3. Run evaluation on Dart test set
+```
+cd ./eval/GenerationEval/
+python eval.py \
+    -R data/references_dart/reference \
+    -H data/hypothesis_dart \
+    -nr 6 \
+    -m bleu,meteor,ter
+cd ../..
+```
+## Citation
+```
+@misc{hu2021lora,
+    title={LoRA: Low-Rank Adaptation of Large Language Models},
+    author={Hu, Edward and Shen, Yelong and Wallis, Phil and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Chen, Weizhu},
+    year={2021},
+    eprint={2106.09685},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```

examples/NLG/SECURITY.md ADDED Viewed

	@@ -0,0 +1,41 @@

+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.5 BLOCK -->
+## Security
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below.
+## Reporting Security Issues
+**Please do not report security vulnerabilities through public GitHub issues.**
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
+If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+This information will help us triage your report more quickly.
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
+## Preferred Languages
+We prefer all communications to be in English.
+## Policy
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
+<!-- END MICROSOFT SECURITY.MD BLOCK -->

examples/NLG/create_datasets.sh ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/bin/bash
+echo "creating e2e datasets..."
+path=data/e2e
+echo "train..."
+python src/format_converting_e2e.py $path/train.txt $path/train_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
+echo "test..."
+python src/format_converting_e2e.py $path/test.txt $path/test_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
+echo "valid..."
+python src/format_converting_e2e.py $path/valid.txt $path/valid_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
+echo "creating webnlg datasets..."
+path=data/webnlg_challenge_2017
+echo "train..."
+python src/format_converting_webnlg.py $path/train.json $path/train_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
+echo "test..."
+python src/format_converting_webnlg.py $path/test.json $path/test_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
+echo "valid..."
+python src/format_converting_webnlg.py $path/dev.json $path/valid_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
+echo "creating dart datasets..."
+path=data/dart
+echo "train..."
+python src/format_converting_dart.py data/dart/dart-v1.1.1-full-train.json data/dart/train_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/train_formatted.jsonl --output $path/train.jsonl --add_bos --add_eos
+echo "test..."
+python src/format_converting_dart.py data/dart/dart-v1.1.1-full-test.json data/dart/test_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/test_formatted.jsonl --output $path/test.jsonl --add_bos --add_eos
+echo "valid..."
+python src/format_converting_dart.py data/dart/dart-v1.1.1-full-dev.json data/dart/valid_formatted.jsonl
+python src/gpt2_encode.py --vocab vocab --input $path/valid_formatted.jsonl --output $path/valid.jsonl --add_bos --add_eos
+echo "script complete!"

examples/NLG/download_pretrained_checkpoints.sh ADDED Viewed

	@@ -0,0 +1,11 @@

+#!/bin/bash
+echo "downloading pretrained model checkpoints..."
+mkdir pretrained_checkpoints
+cd pretrained_checkpoints
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin
+wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin
+cd ..
+echo "script complete!"

examples/NLG/requirement.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+--find-links https://download.pytorch.org/whl/torch_stable.html
+torch==1.7.1+cu101
+transformers==3.3.1
+spacy
+tqdm
+tensorboard
+progress