Deep Learning Approaches to Natural Language Processing
**\n",
"**
Text Compression Assignment
**\n",
"**
Matthias Bartolo
**\n",
"\n",
" \n",
"\n",
"### Package imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Suggested imports. You may add your own here.\n",
"\n",
"%matplotlib inline\n",
"\n",
"import collections\n",
"import random\n",
"import matplotlib.pyplot as plt\n",
"import nltk\n",
"import numpy as np\n",
"import torch\n",
"from collections import Counter\n",
"\n",
"# Variable to control the size of the print statements\n",
"print_size = 87\n",
"\n",
"device = 'cuda' if torch.cuda.is_available() else 'cpu'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text compression assignment\n",
"\n",
"It is said that you can measure the intelligence of an AI from the amount it can compress a text without information loss.\n",
"One way to think about this is that, the more a text is predictable, the more words we can leave out of it as we can guess the missing words.\n",
"On the other hand, the more intelligent an AI is, the more it will find texts to be predictable and so the more words it can leave out and guess.\n",
"This has led to a competition called the [Hutter Prize](http://prize.hutter1.net/) where the objective is to compress a given text as much as possible.\n",
"The record for compressing a 1GB text file extracted from a Wikipedia snapshot is about 115MB.\n",
"The main hurdle here is that the program used to decompress the file must be treated as part of the compressed file, meaning that the program itself must also be small.\n",
"\n",
"In this assignment, you're going to be doing something similar using a smaller text file and using neural language models to guess missing words."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1) Data processing (10%)\n",
"\n",
"You have a train/dev/test split corpus of text from Wikipedia consisting of single sentences.\n",
"Each sentence is on a separate line and each sentence has been tokenised for you such that tokens are space separated.\n",
"This means that you only need to split by space to get the tokens.\n",
"The text has all been lowercased as well.\n",
"The objective here is to be able to compress the text losslessly, meaning that it can be decompressed back to the original string:\n",
"\n",
"$$\\text{decompress}(\\text{compress}(t)) = t$$\n",
"\n",
"Do not do any further pre-processing on the text (such as stemming) as it may result in unrecoverable information loss.\n",
"The test set is what we will be compressing and will not be processed at all as it will be treated as a single big string by the compression/decompression algorithms.\n",
"\n",
"Do the following tasks:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1.1) Load the train set and dev set text files into a list of sentences where each sentence is tokenised (by splitting by space).\n",
"Do not load the test set."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def load_into_sentences(filename):\n",
" \"\"\"Loads the file with the given filename into a list of sentences.\n",
" \n",
" Args:\n",
" filename (str): The filename to load.\n",
" \n",
" Returns:\n",
" list: A list of sentences, where each sentence is a list of words.\n",
" \"\"\"\n",
" # Opening file and reading lines. Stripping each line of whitespace and newlines.\n",
" with open(filename, encoding='utf-8') as f:\n",
" sentences = f.readlines()\n",
"\n",
" # Splitting the line by spaces and stripping each word of whitespace and newlines.\n",
" sentences = [sentence.split() for sentence in sentences]\n",
"\n",
" # Returning the list of sentences.\n",
" return sentences"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Loading the data into lists of sentences.\n",
"dev_sentences = load_into_sentences('dev.txt')\n",
"train_sentences = load_into_sentences('train.txt')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"