Covid19 Related Question Answering (Closed book question answering)

In 2020, COVID-19 which is caused by a coronavirus called SARS-CoV-2 took over the world. It touched the lives of many people and caused a lot of hardship for humanity. There are still many questions in regards to COVID-19 and it is often difficult to get the right answers. The aim of this project is to finetune models for closed book question answering. In closed-book QA, we feed the model a question without any context or access to external knowledge and train it to predict the answer. Since the model doesn't receive any context, the primary way it can learn to answer these questions is based on the "knowledge" it obtained during pre-training [1] [2].

The main goals of this project are:

Train a model for question answering in regards to COVID-19
Release the top performing models for further research and enhancement
Release all of the preprocessing and postprocessing scripts and findings for future research.

TO DO LIST:

Team members met and the following was discussed:
- Data preparation script is prepared that mixes CORD-19 and Pubmed.
- Agreed to finalize the training scripts by 9pm PDT 7/9/2021.
- Tokenizer is now trained.
Setup the pretraining script
Prepare the finetuning tasks inspired from T5 Trivia Colab
- What datasets we want to go with?
  - Covid-QA (Maybe as test set?)
  - Trivia
  - CDC-QA (We can scrape quickly using beautiful soup or something)
  - More Medical Datasets (See the dataset section for inspiratio)

1. Model

We will be using T5 model.

2. Datasets

The following datasets would be used for finetuning the model. Note that the last dataset is optional and the model is evaluated only using Covid-QA.

For Intermediate Pre-Training:

CORD-19

For Fine-Tuning :

3. Training Scripts

We can make use of :

4. Additional Reading

How Much Knowledge Can You Pack Into the Parameters of a Language Model?