gpssohi commited on
Commit
9c081e6
·
1 Parent(s): 939920f

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - summarization
5
+ - question-generation
6
+ license: apache-2.0
7
+ datasets:
8
+ - squad
9
+ ---
10
+
11
+ # Introduction
12
+
13
+ [HuggingFace](https://huggingface.co/) is one of the most useful libraries for a NLP researcher / developer as it provides numerous pre-trained models, datasets, and tons of utility functions for NLP. In this repository, I'm trying to setup a complete pipeline for a Machine Learning project and the task I've chosen for the setup is Question Generation for Paragraphs. This is a seq2seq task for which I intend to fine-tune a pre-trained encoder-decoder Transformer model for Extractive Summarization like BART / Pegasus. More specifically, I'm finetuning the `sshleifer/distilbart-cnn-6-6` model on the SQuAD dataset.
14
+
15
+ # Features / Goals
16
+
17
+ * Environment setup using YAML file
18
+ * Hyper-parameter management with configs [done]
19
+ * Efficient data loading using LMDB [done]
20
+ * Dataset Visualization / Stats [done]
21
+ * Results Visualization / Stats [done]
22
+ * LR Scheduler [done]
23
+ * Multiple Decoding Algorithm Options [done]
24
+ * Intermediate Checkpoints [done]
25
+ * Parallel Logging to file [done]
26
+ * Use Fast Tokenizers [done]
27
+ * Latency + Efficiency Benchmarking [done]
28
+ * Distributed Training and Inference
29
+ * ONNX Optimization [not implemented in hgfc]
30
+ * Model Quantization [done]
31
+ * Model Distillation
32
+ * Hosting using Streamlit / Gradio
33
+ * Deploying on HuggingFace Hub
34
+
35
+ # Dataset
36
+
37
+ The goal of Question Generation is to generate a valid and fluent question according to a given passage and the target answer. Hence, the input to the model will be a passage context and an answer, and the output / target will be the question for the given answer. Question Generation can be used in many scenarios, such as automatic tutoring systems, improving the performance of Question Answering models and enabling chat-bots to lead a conversation. The final dataset is created by taking the union of the following Question Answering Datasets. The dataset must have the following three columns: context, question, answer.
38
+
39
+ ## [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
40
+
41
+ Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowd-workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. We use the SQuAD 1.1 variant which does not have unanswerable questions. So, every question will have a corresponding answer and vice-versa.
42
+
43
+ ### Preprocessing
44
+
45
+ The first step is to remove questions which don't have answers. After that, we split the train set into Train and Eval sets and treat the dev set as the test set.
46
+
47
+ ### Stats
48
+
49
+ **Original Dataset**
50
+
51
+ | Split | Num Docs | Num Contexts | Ques w/ Ans | Ques w/o Ans | Num Unique Ans |
52
+ | ----- | -------- | ------------ | ----------- | ------------ | -------------- |
53
+ | Train | 442 | 19035 | 86821 | 43498 | 86821 |
54
+ | Dev | 35 | 1204 | 5928 | 5945 | 10279 |
55
+
56
+ **After Preprocessing**
57
+
58
+ | Split | Num Rows | Context | Answer | Question |
59
+ | ----- | -------- | ---------- | ------ | -------- |
60
+ | Train | 80995 | 653,120,20 | 43,3,1 | 40,10,1 |
61
+ | Eval | 5826 | 445,123,67 | 28,3,1 | 29,10,3 |
62
+ | Test | 10297 | 629,129,25 | 29,4,1 | 31,10,3 |
63
+
64
+ The numbers in the columns indicate max, avg, min number of words.
65
+
66
+ ## [Natural Questions](https://ai.google.com/research/NaturalQuestions) [Not Used]
67
+
68
+ The Natural Questions corpus is a question answering dataset by Google. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.
69
+
70
+ ## [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) [Not Used]
71
+
72
+ TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.
73
+
74
+ # Directory Structure
75
+
76
+ ```bash
77
+ |-- README.md
78
+ |-- data
79
+ | |-- squad
80
+ | | |-- raw
81
+ | | | |-- train-v2.0.json
82
+ | | | |-- dev-v2.0.json
83
+ | | |-- processed
84
+ | | | |-- [processed data files]
85
+ | | | |-- splits
86
+ | | | | |-- train.tsv
87
+ | | | | |-- eval.tsv
88
+ | | | | |-- test.tsv
89
+ | | | | |-- lmdb*
90
+ | | | | | |-- train.lmdb*
91
+ | | | | | |-- eval.lmdb*
92
+ | | | | | |-- test.lmdb*
93
+ |-- data-format
94
+ | |-- squad.py
95
+ |-- data-utils
96
+ | |-- proto
97
+ | | |-- data_item.proto
98
+ | | |-- data_item_pb2.py*
99
+ | |-- data_stats.py
100
+ | |-- create_lmdb.py
101
+ |-- src
102
+ | |-- util.py
103
+ | |-- plotting.py
104
+ | |-- dataset.py
105
+ | |-- train.py
106
+ | |-- evaluate.py
107
+ | |-- predict.py
108
+ | |-- generate.py
109
+ |-- config
110
+ | |-- train.config
111
+ | |-- eval.config
112
+ | |-- pred.config
113
+ |-- vis
114
+ | |-- tsv_viewer.py
115
+ |-- stats
116
+ | |-- [stats related files]
117
+ |-- benchmark
118
+ | |-- benchmark.py
119
+ | |-- wordlist.txt
120
+ | |-- results.txt
121
+ |-- logs
122
+ | |-- train
123
+ | | |-- run_4 [20 epochs, no scheduler]
124
+ | | |-- run_6 [4 epochs and scheduler]
125
+ | |-- pred
126
+ | | |-- run_2 [default decoding params]
127
+ | | |-- run_3 [adjusted decoding params]
128
+ | | |-- run_4 [dynamic quantized pred]
129
+
130
+ * : file created programmatically
131
+ ```
132
+
133
+ # Commands to run
134
+
135
+ ```bash
136
+
137
+ # preliminary stats on the squad dataset
138
+ python3 squad.py --input_dir ../data/squad/raw/ --task raw_stats
139
+
140
+ # prepare squad question answering data for consumption by converting to tsv
141
+ python3 squad.py --input_dir ../data/squad/raw/ --output_dir ../data/squad/processed/ --task json2tsv
142
+
143
+ # at this point manually split the train set into train and eval in ./splits
144
+
145
+ # fetch initial stats on the dataset
146
+ python3 data_stats.py --input_path ../data/squad/processed/splits/ --output_path ../stats/squad/
147
+
148
+ # take a look at a few samples of the dataset
149
+ streamlit run tsv_viewer.py -- --input_path ../data/squad/processed/splits/eval.tsv
150
+
151
+ # convert the tsv data into lmdb database for efficient loading
152
+ python3 -m grpc_tools.protoc -I./proto --python_out=./proto ./proto/data_item.proto
153
+ python3 create_lmdb.py --input_path ../data/squad/processed/splits/ --output_path ../data/squad/processed/splits/lmdb/
154
+
155
+ # training routine [adjust params in config]
156
+ python3 train.py --config_filename ../config/train.config
157
+
158
+ # evaluation routine [adjust params in config]
159
+ python3 evaluate.py --config_filename ../config/eval.config
160
+
161
+ # get predictions [adjust params in config]
162
+ python3 predict.py --config_filename ../config/pred.config
163
+
164
+ # to get interactive predictions [adjust params in config]
165
+ python3 generate.py --config_filename ../config/pred.config
166
+
167
+ # view the results using streamlit
168
+ streamlit run tsv_viewer.py -- --input_path ../logs/pred/run_6/eval.tsv
169
+
170
+ ```