# Get Data
The data from wikipedia starts in XML, this is a relatively simple way to format that into a single json for our purposes.

## Initialize Variables

In [1]:
from pathlib import Path
import sys

In [16]:
proj_dir_path = Path.cwd().parent
proj_dir = str(proj_dir_path)

# So we can import later
sys.path.append(proj_dir)

## Install Libraries

In [3]:
%pip install -q -r "$proj_dir"/requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Download Latest Simple Wikipedia

Im getting "latest" but its good to see what version it is nonetheless.

In [4]:
!curl -I https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2 --silent | grep "Last-Modified"

Last-Modified: Sun, 01 Oct 2023 23:32:27 GMT


Download simple wikipedia

In [5]:
!wget -nc -P "$proj_dir"/data/raw https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2

--2023-10-18 10:55:38--  https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 286759308 (273M) [application/octet-stream]
Saving to: ‘/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2’


2023-10-18 10:56:45 (4.13 MB/s) - ‘/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2’ saved [286759308/286759308]



## Extract from XML
The download format from wikipedia is in XML. `wikiextractor` will convert this into a jsonl format split into many folders and files.

In [9]:
!wikiextractor -o "$proj_dir"/data/raw/output  --json "$proj_dir"/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2 

INFO: Preprocessing '/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Loaded 36594 templates in 54.1s
INFO: Starting page extraction from /home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2.
INFO: Using 3 extract processes.
INFO: Extracted 100000 articles (3481.4 art/s)
INFO: Extracted 200000 articles (3764.9 art/s)
INFO: Extracted 300000 articles (4175.8 art/s)
INFO: Finished 3-process extraction of 332024 articles in 86.9s (3822.7 art/s)


## Consolidate into json

The split format is tedious to deal with, so now we we will consolidate this into 1 json file. This is fine since our data fits easily in RAM. But if it didnt, there are better options.

Feel free to check out the [consolidate file](../src/preprocessing/consolidate.py) for more details.

In [14]:
from src.preprocessing.consolidate import folder_to_json

In [17]:
folder = proj_dir_path / 'data/raw/output'
file_out = proj_dir_path / 'data/consolidated/simple_wiki.json'
folder_to_json(folder, file_out)

Processing:   0%|          | 0/206 [00:00<?, ?file/s]

Wiki processed in 2.92 seconds!
Writing file!
File written in 3.08 seconds!
