{ "cells": [ { "cell_type": "markdown", "id": "12d87b30", "metadata": {}, "source": [ "# Load Data\n", "This notebook loads and preproceses all necessary data, namely the following.\n", "* OpenWebTextCorpus: for base DistilBERT model\n", "* SQuAD datasrt: for Q&A\n", "* Natural Questions (needs to be downloaded externally but is preprocessed here): for Q&A\n", "* HotPotQA: for Q&A" ] }, { "cell_type": "code", "execution_count": 4, "id": "7c82d7fa", "metadata": {}, "outputs": [], "source": [ "from tqdm.auto import tqdm\n", "from datasets import load_dataset\n", "import os\n", "import pandas as pd\n", "import random" ] }, { "cell_type": "markdown", "id": "1737f219", "metadata": {}, "source": [ "## Distilbert Data\n", "In the following, we download the english openwebtext dataset from huggingface (https://huggingface.co/datasets/openwebtext). The dataset is provided by Aaron Gokaslan and Vanya Cohen from Brown University (https://skylion007.github.io/OpenWebTextCorpus/).\n", "\n", "We first load the data, investigate the structure and write the dataset into files of each 10 000 texts." ] }, { "cell_type": "code", "execution_count": null, "id": "cce7623c", "metadata": {}, "outputs": [], "source": [ "ds = load_dataset(\"openwebtext\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "678a5e86", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['text'],\n", " num_rows: 8013769\n", " })\n", "})" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we have a text-only training dataset with 8 million entries\n", "ds" ] }, { "cell_type": "code", "execution_count": 5, "id": "b141bce7", "metadata": {}, "outputs": [], "source": [ "# create necessary folders\n", "os.mkdir('data')\n", "os.mkdir('data/original')" ] }, { "cell_type": "code", "execution_count": null, "id": "ca94f995", "metadata": {}, "outputs": [], "source": [ "# save text in chunks of 10000 samples\n", "text = []\n", "i = 0\n", "\n", "for sample in tqdm(ds['train']):\n", " # replace all newlines\n", " sample = sample['text'].replace('\\n','')\n", " \n", " # append cleaned sample to all texts\n", " text.append(sample)\n", " \n", " # if we processed 10000 samples, write them to a file and start over\n", " if len(text) == 10000:\n", " with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n", " f.write('\\n'.join(text))\n", " text = []\n", " i += 1 \n", "\n", "# write remaining samples to a file\n", "with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n", " f.write('\\n'.join(text))" ] }, { "cell_type": "markdown", "id": "f131dcfc", "metadata": {}, "source": [ "### Testing\n", "If we load the first file, we should get a file that is 10000 lines long and has one column\n", "\n", "As we do not preprocess the data in any way, but just write the read text into the file, this is all testing necessary" ] }, { "cell_type": "code", "execution_count": 13, "id": "df50af74", "metadata": {}, "outputs": [], "source": [ "with open(\"data/original/text_0.txt\", 'r', encoding='utf-8') as f:\n", " lines = f.read().split('\\n')\n", "lines = pd.DataFrame(lines)" ] }, { "cell_type": "code", "execution_count": 14, "id": "8ddb0085", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Passed\n" ] } ], "source": [ "assert lines.shape==(10000,1)\n", "print(\"Passed\")" ] }, { "cell_type": "markdown", "id": "1a65b268", "metadata": {}, "source": [ "## SQuAD Data\n", "In the following, we download the SQuAD dataset from huggingface (https://huggingface.co/datasets/squad). It was initially provided by Rajpurkar et al. from Stanford University.\n", "\n", "We again load the dataset and store it in chunks of 1000 into files." ] }, { "cell_type": "code", "execution_count": null, "id": "6750ce6e", "metadata": {}, "outputs": [], "source": [ "dataset = load_dataset(\"squad\")" ] }, { "cell_type": "code", "execution_count": null, "id": "65a7ee23", "metadata": {}, "outputs": [], "source": [ "os.mkdir(\"data/training_squad\")\n", "os.mkdir(\"data/test_squad\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f6ebf63e", "metadata": {}, "outputs": [], "source": [ "# we already have a training and test split. Each sample has an id, title, context, question and answers.\n", "dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "f67ae448", "metadata": {}, "outputs": [], "source": [ "# answers are provided like that - we need to extract answer_end for the model\n", "dataset['train']['answers'][0]" ] }, { "cell_type": "code", "execution_count": null, "id": "101cd650", "metadata": {}, "outputs": [], "source": [ "# column contains the split (either train or validation), save_dir is the directory\n", "def save_samples(column, save_dir):\n", " text = []\n", " i = 0\n", "\n", " for sample in tqdm(dataset[column]):\n", " \n", " # preprocess the context and question by removing the newlines\n", " context = sample['context'].replace('\\n','')\n", " question = sample['question'].replace('\\n','')\n", "\n", " # get the answer as text and start character index\n", " answer_text = sample['answers']['text'][0]\n", " answer_start = str(sample['answers']['answer_start'][0])\n", " \n", " text.append([context, question, answer_text, answer_start])\n", "\n", " # we choose chunks of 1000\n", " if len(text) == 1000:\n", " with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n", " f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n", " text = []\n", " i += 1\n", "\n", " # save remaining\n", " with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n", " f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n", "\n", "save_samples(\"train\", \"training_squad\")\n", "save_samples(\"validation\", \"test_squad\")\n", " " ] }, { "cell_type": "markdown", "id": "67044d13", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "### Testing\n", "If we load a file, we should get a file with 10000 lines and 4 columns\n", "\n", "Also, we want to assure the correct interval. Hence, the second test." ] }, { "cell_type": "code", "execution_count": null, "id": "446281cf", "metadata": {}, "outputs": [], "source": [ "with open(\"data/training_squad/text_0.txt\", 'r', encoding='utf-8') as f:\n", " lines = f.read().split('\\n')\n", " \n", "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "ccd5c650", "metadata": {}, "outputs": [], "source": [ "assert lines.shape==(1000,4)\n", "print(\"Passed\")" ] }, { "cell_type": "code", "execution_count": null, "id": "2c9e4b70", "metadata": {}, "outputs": [], "source": [ "# we assert that we have the right interval\n", "for ind, line in lines.iterrows():\n", " sample = line\n", " answer_start = int(sample['answer_start'])\n", " assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']\n", "print(\"Passed\")" ] }, { "cell_type": "markdown", "id": "02265ace", "metadata": {}, "source": [ "## Natural Questions Dataset\n", "* Download from https://ai.google.com/research/NaturalQuestions via gsutil (the one from huggingface has 134.92GB, the one from google cloud is in archives)\n", "* Use gunzip to get some samples - we then get `.jsonl`files\n", "* The dataset is a lot more messy, as it is just wikipedia articles with all web artifacts\n", " * I cleaned the html tags\n", " * Also I chose a random interval (containing the answer) from the dataset\n", " * We can't send the whole text into the model anyways" ] }, { "cell_type": "code", "execution_count": null, "id": "f3bce0c1", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "paths = [str(x) for x in Path('data/natural_questions/v1.0/train/').glob('**/*.jsonl')]" ] }, { "cell_type": "code", "execution_count": null, "id": "e9c58c00", "metadata": {}, "outputs": [], "source": [ "os.mkdir(\"data/natural_questions_train\")" ] }, { "cell_type": "code", "execution_count": null, "id": "0ed7ba6c", "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "# clean html tags\n", "CLEANR = re.compile('<.+?>')\n", "# clean multiple spaces\n", "CLEANMULTSPACE = re.compile('(\\s)+')\n", "\n", "# the function takes an html documents and removes artifacts\n", "def cleanhtml(raw_html):\n", " # tags\n", " cleantext = re.sub(CLEANR, '', raw_html)\n", " # newlines\n", " cleantext = cleantext.replace(\"\\n\", '')\n", " # tabs\n", " cleantext = cleantext.replace(\"\\t\", '')\n", " # character encodings\n", " cleantext = cleantext.replace(\"'\", \"'\")\n", " cleantext = cleantext.replace(\"&\", \"'\")\n", " cleantext = cleantext.replace(\""\", '\"')\n", " # multiple spaces\n", " cleantext = re.sub(CLEANMULTSPACE, ' ', cleantext)\n", " # documents end with this tags, if it is present in the string, cut it off\n", " idx = cleantext.find(\"