### Setting Up the data

In [1]:
import os
source_language = "fr"
target_language = "swc" # ln is the language code of lingala 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline_v2" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# No need to use gdrive since we are training on gcp
!mkdir -p "$src-$tgt-$tag"
os.environ["gdrive_path"] = "%s-%s-%s" % (source_language, target_language, tag) # saving directly on the vm

In [2]:
!echo $gdrive_path

fr-swc-baseline_v2


#### Downloading the corpus data

remove the old data and redownload them for verification puprose

In [3]:
!rm -f w300.$src jw300.$tgt JW300_latest_xml_$src-$tgt.xml.gz JW300_latest_xml_$src-$tgt.xml

In [4]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/fr-swc.xml.gz not found. The following files are available for downloading:

   5 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/fr-swc.xml.gz
 278 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/fr.zip
  54 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/swc.zip

 338 MB Total size
./JW300_latest_xml_fr-swc.xml.gz ... 100% of 5 MB
./JW300_latest_xml_fr.zip ... 100% of 278 MB
./JW300_latest_xml_swc.zip ... 100% of 54 MBwc.zip ... 37% of 54 MB


In [5]:
#! wget https://raw.githubusercontent.com/espoirMur/masakhane/add-french-global-test-set/jw300_utils/test/test.fr-any.fr

os.environ["trg"] = target_language 
os.environ["src"] = source_language 

#! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$trg.en 
#! mv test.en-$trg.en test.en
#! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$trg.$trg 
#! mv test.en-$trg.$trg test.$trg

read the french test data generate in [this notebook](./buiding_french_global_test_set.ipynb) and save the data in a set for quick retrieval

### Generating test French Swahili Congo dataset

First we get the global english swahili congo test set 

In [6]:
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-swc.en

--2020-02-13 08:27:54--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-swc.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.192.133, 151.101.128.133, 151.101.64.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205351 (201K) [text/plain]
Saving to: ‘test.en-swc.en’


2020-02-13 08:27:55 (4.81 MB/s) - ‘test.en-swc.en’ saved [205351/205351]



In [10]:
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-swc.swc

--2020-02-13 08:34:07--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-swc.swc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.192.133, 151.101.128.133, 151.101.64.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231897 (226K) [text/plain]
Saving to: ‘test.en-swc.swc’


2020-02-13 08:34:07 (5.55 MB/s) - ‘test.en-swc.swc’ saved [231897/231897]



The we check if we have alogned data

In [3]:
!head -10 jw300.fr

Sommaire
8 janvier 2000
Médecine et chirurgie sans transfusion : une discipline en plein essor
De plus en plus de patients optent pour la chirurgie sans transfusion .
Pourquoi , et quels sont les résultats ?
3 Pionniers de la médecine
4 Transfusion sanguine : une histoire riche en revirements
7 Un intérêt croissant pour la médecine et la chirurgie sans transfusion
12 Voulez - ​ vous apprendre une langue étrangère ?
14 Le café : mauvais pour le cholestérol ?


In [4]:
!head -10 jw300.swc

Ukurasa wa pili
Mwezi wa 8 , 2000
Tiba na Upasuaji Bila Damu Uhitaji Unaoongezeka 3 - 11
Sasa tiba na upasuaji bila damu ni wa kawaida zaidi kuliko wakati mwingine wowote .
Kwa nini unahitajiwa sana namna hiyo ?



Je , ni njia badala iliyo salama ya utiaji - damu mishipani ?



In [5]:
source_file = "jw300.swc"
target_file = "jw300.fr"
test_file = "test.en-swc.swc"
with open(test_file) as swc_test_file, open(target_file) as fr_full_file, open(source_file) as swc_full_file:
    swc_test_sentences = swc_test_file.readlines()
    fr_full_sentences = fr_full_file.readlines()
    swc_full_sentences = swc_full_file.readlines()

In [6]:
swc_test_sentences = [sentence.strip() for sentence in swc_test_sentences]
fr_full_sentences = [sentence.strip() for sentence in fr_full_sentences]
swc_full_sentences = [sentence.strip() for sentence in swc_full_sentences]

In [7]:
swc_full_sentences[:10]

['Ukurasa wa pili',
 'Mwezi wa 8 , 2000',
 'Tiba na Upasuaji Bila Damu Uhitaji Unaoongezeka 3 - 11',
 'Sasa tiba na upasuaji bila damu ni wa kawaida zaidi kuliko wakati mwingine wowote .',
 'Kwa nini unahitajiwa sana namna hiyo ?',
 '',
 '',
 '',
 'Je , ni njia badala iliyo salama ya utiaji - damu mishipani ?',
 '']

In [8]:
fr_full_sentences[:10]

['Sommaire',
 '8 janvier 2000',
 'Médecine et chirurgie sans transfusion : une discipline en plein essor',
 'De plus en plus de patients optent pour la chirurgie sans transfusion .',
 'Pourquoi , et quels sont les résultats ?',
 '3 Pionniers de la médecine',
 '4 Transfusion sanguine : une histoire riche en revirements',
 '7 Un intérêt croissant pour la médecine et la chirurgie sans transfusion',
 '12 Voulez - \u200b vous apprendre une langue étrangère ?',
 '14 Le café : mauvais pour le cholestérol ?']

For any swahili sentence in the global swahili test set, get the french equivalent 

In [9]:
matching_fr_test_sentences = []
matching_swc_test_sentences = []

for index, swahili_line in enumerate(swc_full_sentences):
    if swahili_line in swc_test_sentences and swahili_line:
        matching_fr_test_sentences.append(fr_full_sentences[index])
        matching_swc_test_sentences.append(swahili_line)

In [10]:
import pandas as pd

In [11]:
french_swc_test_dataset = pd.DataFrame(zip(matching_fr_test_sentences, matching_swc_test_sentences), columns=['french_sentence', 'swahili_congo_sentence'])

In [12]:
french_swc_test_dataset.head(10)

Unnamed: 0,french_sentence,swahili_congo_sentence
0,"Par souci d’anonymat , certains noms ont été c...",Baadhi ya majina katika makala haya yamebadili...
1,"Publié par les Témoins de Jéhovah , mais épuisé .",Kilichapishwa na Mashahidi wa Yehova lakini ha...
2,Vous ne pouvez travailler comme des esclaves p...,Hamwezi kutumikia kama watumwa Mungu na Utajir...
3,Vous ne pouvez travailler comme des esclaves p...,Hamwezi kutumikia kama watumwa Mungu na Utajir...
4,Il n’appartient pas à l’homme qui marche de di...,"Ama kwa hakika , mwanadamu anahitaji msaada wa..."
5,“ Le monde entier se trouve au pouvoir du méch...,“ Ulimwengu mzima unakaa katika nguvu za yule ...
6,"Ne regarde pas tout autour , car je suis ton D...","Usitazame huku na huku , kwa maana mimi ni Mun..."
7,"Ne regarde pas tout autour , car je suis ton D...","Usitazame huku na huku , kwa maana mimi ni Mun..."
8,“ Un véritable compagnon aime tout le temps et...,"“ Rafiki wa kweli anapenda nyakati zote , naye..."
9,"En outre , n’appelez personne votre père sur l...","Zaidi ya hayo , msimwite mtu yeyote baba yenu ..."


In [13]:
french_swc_test_dataset.tail(10)

Unnamed: 0,french_sentence,swahili_congo_sentence
2840,Et qu’est - ​ ce qui est plus joli : le bruit ...,Na namna gani unaweza kulinganisha mulio wa av...
2841,"Alors , qui est le plus intelligent : le créat...",Ni nani mwenye kuwa na akili sana ; mutu mweny...
2842,Un père conseille : « Ne vous fatiguez jamais ...,Baba mumoja alisema hivi : “ Usichoke hata kid...
2843,"Quand ils étaient tout petits , je faisais 15 ...","Tangu wakati walikuwa wadogo sana , nilijifunz..."
2844,"Avec le temps , j’ai obtenu beaucoup de répons...","Kisha wakati fulani , nilipata majibu mengi ya..."
2845,C’est pour cela que c’est important que les pa...,Ndiyo sababu ni jambo la maana wazazi wasiache...
2846,Montre - ​ leur que Jéhovah est vraiment réel ...,Acha watoto wako watambue kama unaona Yehova k...
2847,« Nous disons à la plus grande : “ Fais totale...,Wazazi hao wanasema hivi : “ Tuliambia pia mut...
2848,Lorsqu’elle voit comment les choses s’arrangen...,"Wakati anaona matokeo , anaelewa kama Yehova a..."
2849,C’est excellent pour sa foi en Dieu et en la B...,Hilo limemusaidia sana akuwe na imani yenye ng...


Remove duplicates from both sides

In [14]:
french_swc_test_dataset = french_swc_test_dataset.drop_duplicates(subset='french_sentence')
french_swc_test_dataset = french_swc_test_dataset.drop_duplicates(subset='swahili_congo_sentence')
french_swc_test_dataset.shape

(2478, 2)

In [15]:
fr_test_sents = set()
filter_test_sents = "test.fr-any.fr"
j = 0
with open(filter_test_sents) as f:
    for line in f:
        fr_test_sents.add(line.strip())
        j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3332 global test sentences to filter from the training/dev data.


In [34]:
french_swc_test_dataset.loc[~french_swc_test_dataset.french_sentence.isin(fr_test_sents)].shape

(643, 2)

In [19]:
french_swc_test_dataset.loc[~french_swc_test_dataset.french_sentence.isin(fr_test_sents)].head()

Unnamed: 0,french_sentence,swahili_congo_sentence
4,Il n’appartient pas à l’homme qui marche de di...,"Ama kwa hakika , mwanadamu anahitaji msaada wa..."
5,“ Le monde entier se trouve au pouvoir du méch...,“ Ulimwengu mzima unakaa katika nguvu za yule ...
8,“ Un véritable compagnon aime tout le temps et...,"“ Rafiki wa kweli anapenda nyakati zote , naye..."
9,"En outre , n’appelez personne votre père sur l...","Zaidi ya hayo , msimwite mtu yeyote baba yenu ..."
12,"“ Les justes posséderont la terre , et sur ell...","“ Waadilifu wenyewe wataimiliki dunia , nao wa..."


Only 643 sentences are in the french swc test set but not in the global test set!

Add those file to the global test set

In [26]:
french_swc_test_dataset.loc[~french_swc_test_dataset.french_sentence.isin(fr_test_sents)].french_sentence.to_csv('test.fr-any.fr', mode='a', header=False, index=False)

In [29]:
french_swc_test_dataset.loc[~french_swc_test_dataset.french_sentence.isin(fr_test_sents)].french_sentence.tail(10)

2837    Elle raconte : « Ils l’ont fait avec beaucoup ...
2838    Quand je leur ai demandé pourquoi ils avaient ...
2842    Un père conseille : « Ne vous fatiguez jamais ...
2843    Quand ils étaient tout petits , je faisais 15 ...
2844    Avec le temps , j’ai obtenu beaucoup de répons...
2845    C’est pour cela que c’est important que les pa...
2846    Montre - ​ leur que Jéhovah est vraiment réel ...
2847    « Nous disons à la plus grande : “ Fais totale...
2848    Lorsqu’elle voit comment les choses s’arrangen...
2849    C’est excellent pour sa foi en Dieu et en la B...
Name: french_sentence, dtype: object

In [27]:
!tail -10 test.fr-any.fr

Elle raconte : « Ils l’ont fait avec beaucoup de soin .
"Quand je leur ai demandé pourquoi ils avaient fait autant attention , ils ont répondu qu’ils voulaient que le café soit exactement comme je l’aime ."
Un père conseille : « Ne vous fatiguez jamais d’essayer de nouvelles méthodes pour reparler de vieux sujets .
"Quand ils étaient tout petits , je faisais 15 minutes d’étude avec eux tous les jours , sauf les jours de réunion ."
"Avec le temps , j’ai obtenu beaucoup de réponses aux réunions , ou durant l’étude familiale ou individuelle ."
C’est pour cela que c’est important que les parents n’arrêtent jamais d’enseigner .
Montre - ​ leur que Jéhovah est vraiment réel pour toi .
"« Nous disons à la plus grande : “ Fais totalement confiance à Jéhovah , reste active dans le service du Royaume et ne t’inquiète pas trop . ”"
"Lorsqu’elle voit comment les choses s’arrangent , elle se rend compte que Jéhovah nous aide ."
C’est excellent pour sa foi en Dieu et en la Bible . »


Save the test set with to new files 

In [95]:
french_swc_test_dataset.swahili_congo_sentence.to_csv("test.swc",sep='\n', header=False, index=False)
french_swc_test_dataset.french_sentence.to_csv("test.fr",sep='\n', header=False, index=False)

In [None]:
!ls

#### Building the model dataset

In [96]:
!head -5 test.$src

Par souci d’anonymat , certains noms ont été changés .
Publié par les Témoins de Jéhovah , mais épuisé .
Vous ne pouvez travailler comme des esclaves pour Dieu et pour la Richesse .
Il n’appartient pas à l’homme qui marche de diriger son pas .
“ Le monde entier se trouve au pouvoir du méchant .


In [97]:
!head -5 test.$tgt

Baadhi ya majina katika makala haya yamebadilishwa .
Kilichapishwa na Mashahidi wa Yehova lakini hakichapwi tena .
Hamwezi kutumikia kama watumwa Mungu na Utajiri . ”
Ama kwa hakika , mwanadamu anahitaji msaada wa Mungu .
“ Ulimwengu mzima unakaa katika nguvu za yule mwovu . ”


In [59]:
test_lines_to_ignore = fr_test_sents.union(set(french_swc_test_dataset.french_sentence.unique()))

In [60]:
# TMX file to dataframe
source_file = 'jw300.' + source_language
target_file = 'jw300.' + target_language

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set or in the frc_test_set
        if line.strip() not in test_lines_to_ignore:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
# if you get TypeError: data argument can't be an iterator is because of your zip version run this below
#df = pd.DataFrame(list(zip(source, target)), columns=['source_sentence', 'target_sentence'])
df.head(10)

Loaded data and skipped 13705/576109 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,8 janvier 2000,"Mwezi wa 8 , 2000"
1,Médecine et chirurgie sans transfusion : une d...,Tiba na Upasuaji Bila Damu Uhitaji Unaoongezek...
2,De plus en plus de patients optent pour la chi...,Sasa tiba na upasuaji bila damu ni wa kawaida ...
3,"Pourquoi , et quels sont les résultats ?",Kwa nini unahitajiwa sana namna hiyo ?
4,3 Pionniers de la médecine,
5,4 Transfusion sanguine : une histoire riche en...,
6,7 Un intérêt croissant pour la médecine et la ...,
7,12 Voulez - ​ vous apprendre une langue étrang...,"Je , ni njia badala iliyo salama ya utiaji - d..."
8,14 Le café : mauvais pour le cholestérol ?,
9,20 Le dilemme des mères atteintes du sida,


Some tests :

In [61]:
df.tail(10)

Unnamed: 0,source_sentence,target_sentence
562395,"Comme les chrétiens hébreux , nous pouvons étu...","Kama vile Wakristo Waebrania , tunaweza kusoma..."
562396,"Pour montrer que cette promesse est biblique ,...",Ili kukazia kama ahadi hiyo inategemea Maandik...
562397,Nous sommes touchés de savoir que « la promess...,Inagusa sana mioyo yetu kujua kama ‘ ahadi ime...
562398,Nous sommes convaincus que c’est possible d’en...,Tunatumaini kama inawezekana kabisa kuingia ka...
562399,"Pas en obéissant à la Loi de Moïse , ni en fai...",Hatufanye hivyo kwa kujaribu kufuata Sheria ya...
562400,Mais c’est plutôt en travaillant avec foi et d...,"Lakini , kwa sababu ya imani yetu kwa Mungu , ..."
562401,"De plus , des milliers de personnes dans le mo...",Mamilioni ya watu katika dunia yote wameanza p...
562402,Cette étude a motivé beaucoup d’entre elles à ...,Hilo limechochea wengi kati yao wafanye mabadi...
562403,L’effet que « la parole de Dieu » a sur ces pe...,Namna wanaendelea kufanya mabadiliko inaonyesh...
562404,Les déclarations de Jéhovah sur son projet qui...,Mambo yenye Mungu amefunua katika Biblia juu y...


In [62]:
df.source_sentence[100]

'Pour toutes ces raisons , la médecine sans transfusion a suscité , ces dernières années , un intérêt croissant .'

In [63]:
df.target_sentence[100]

'Basi yaeleweka ni kwa nini katika miaka ya karibuni watu wengi wamependezwa zaidi na tiba na upasuaji bila damu .'

### Pre-processing and export

It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.

In addition we will split our data into dev/test/train and export to the filesystem.

In [64]:
df_without_duplicates = df.drop_duplicates()

In [65]:
df_without_duplicates = df_without_duplicates.drop_duplicates(subset='source_sentence', inplace=False)
df_without_duplicates = df_without_duplicates.drop_duplicates(subset='target_sentence', inplace=False)

Remove all sentences that can be found in the swahili test set

In [72]:
df_without_duplicates = df_without_duplicates.loc[~df_without_duplicates.target_sentence.isin(french_swc_test_dataset.swahili_congo_sentence)]

In [73]:
df_without_duplicates.shape

(503504, 2)

The following lines am checking that all sentences we have kept in the main dataset are not in the test set anymore

In [85]:
assert  df_without_duplicates.loc[df_without_duplicates.target_sentence.isna()].shape[0] == 0

We can also check if the splitting was done corectly 

In [80]:
assert df_without_duplicates.loc[df_without_duplicates.source_sentence.isin(fr_test_sents)].shape[0] == 0

In [81]:
assert df_without_duplicates.loc[df_without_duplicates.source_sentence.isin(french_swc_test_dataset.french_sentence)].shape[0] == 0

In [82]:
assert df_without_duplicates.loc[df_without_duplicates.target_sentence.isin(french_swc_test_dataset.swahili_congo_sentence)].shape[0] == 0

If any of these test fails please check preprocessing

In [83]:
df_without_duplicates = df_without_duplicates.sample(frac=1, random_state=seed).reset_index(drop=True)

In [84]:
df_without_duplicates.head(10)

Unnamed: 0,source_sentence,target_sentence
0,"Comme l’endroit vous est inconnu , vous êtes q...","Kwa kuwa uko ugenini , unaanza kuwa na wasiwasi ."
1,"Enfin , Paul se montrait remarquablement inven...",Paulo alionyesha ubunifu wa hali ya juu alipos...
2,"Touchés par ces pensées , les membres de cette...",Kwa kuwa waliheshimu Neno la Mungu na kufurahi...
3,Ils ne disposent pas non plus des milliers de ...,Na hawana maelfu ya dola zinazohitajika ili wa...
4,"Parfois , on brisait les os pour recueillir la...",Nyakati nyingine mifupa huvunjwa ili uboho uto...
5,Ils partagent les sentiments de frère Alexande...,"H . Macmillan , ambaye baada ya karibu miaka 6..."
6,La Bible est le livre le plus fiable sur ce su...,Biblia ndicho kitabu kinachotegemeka zaidi kuh...
7,D’autres ont accepté d’étudier la Bible avec n...,Wengine wamekubali tujifunze nao Biblia lakini...
8,” Il est important de noter qu’un seul symptôm...,Ni muhimu kung’amua kwamba dalili moja tu hait...
9,Jésus pensait - ​ il à cela ?,"Je , Yesu alikuwa anarejelea lango hilo ?"


In [86]:
df_without_duplicates.shape

(503504, 2)

In [87]:
import time
from fuzzywuzzy import process
import numpy as np
from os import cpu_count
from functools import partial
from multiprocessing import Pool

In [88]:
## reset the index after
df_without_duplicates.reset_index(inplace=True, drop=False)

In [89]:
# Filtering function. Adjust pad to narrow down the candidate matches to
# within a certain length of characters of the given sample.
def fuzzfilter(sample, candidates, pad):
    candidates = [x for x in candidates if len(x) <= len(sample)+pad and len(x) >= len(sample)-pad] 
    if len(candidates) > 0:
        return process.extractOne(sample, candidates)[1]
    else:
        return np.nan

In [90]:
start_time = time.time()
### iterating over pandas dataframe rows is not recomended, let use multi processing to apply the function

with Pool(cpu_count()-1) as pool:
    scores = pool.map(partial(fuzzfilter, candidates=list(fr_test_sents), pad=5), df_without_duplicates['source_sentence'])
hours, rem = divmod(time.time() - start_time, 3600)
minutes, seconds = divmod(rem, 60)
print("done in {}h:{}min:{}seconds".format(hours, minutes, seconds))

# Filter out "almost overlapping samples"
df_without_duplicates = df_without_duplicates.assign(scores=scores)



done in 0.0h:43.0min:55.992006063461304seconds


In [92]:
df_without_duplicates = df_without_duplicates[df_without_duplicates['scores'] < 95]

In [93]:
df_without_duplicates.to_csv('data_process_without_duplicates.csv')

In [94]:
df_without_duplicates.head()

Unnamed: 0,index,source_sentence,target_sentence,scores
0,0,"Comme l’endroit vous est inconnu , vous êtes q...","Kwa kuwa uko ugenini , unaanza kuwa na wasiwasi .",56.0
1,1,"Enfin , Paul se montrait remarquablement inven...",Paulo alionyesha ubunifu wa hali ya juu alipos...,51.0
2,2,"Touchés par ces pensées , les membres de cette...",Kwa kuwa waliheshimu Neno la Mungu na kufurahi...,53.0
3,3,Ils ne disposent pas non plus des milliers de ...,Na hawana maelfu ya dola zinazohitajika ili wa...,54.0
4,4,"Parfois , on brisait les os pour recueillir la...",Nyakati nyingine mifupa huvunjwa ili uboho uto...,55.0


splitting the data into train/test sets

In [98]:

# We use 1000 dev test and the given test set.
import csv

# Do the split between dev/train and create parallel corpora
num_dev_patterns = 1000

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:  # Julia: making lowercasing optional
    df_without_duplicates["source_sentence"] = df_without_duplicates["source_sentence"].str.lower()
    df_without_duplicates["target_sentence"] = df_without_duplicates["target_sentence"].str.lower()

# Julia: test sets are already generated
dev = df_without_duplicates.tail(num_dev_patterns) # Herman: Error in original
stripped = df_without_duplicates.drop(df_without_duplicates.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
    for index, row in stripped.iterrows():
        src_file.write(row["source_sentence"]+"\n")
        trg_file.write(row["target_sentence"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
    for index, row in dev.iterrows():
        src_file.write(row["source_sentence"]+"\n")
        trg_file.write(row["target_sentence"]+"\n")

#stripped[["source_sentence"]].to_csv("train."+source_language, header=False, index=False)  # Herman: Added `header=False` everywhere
#stripped[["target_sentence"]].to_csv("train."+target_language, header=False, index=False)  # Julia: Problematic handling of quotation marks.

#dev[["source_sentence"]].to_csv("dev."+source_language, header=False, index=False)
#dev[["target_sentence"]].to_csv("dev."+target_language, header=False, index=False)

# Doublecheck the format below. There should be no extra quotation marks or weird characters.
! head train.*
! head dev.*

==> train.fr <==
Comme l’endroit vous est inconnu , vous êtes quelque peu angoissé .
Enfin , Paul se montrait remarquablement inventif pour adapter son enseignement à différents auditoires .
Touchés par ces pensées , les membres de cette famille , qui avaient un grand respect pour la Parole de Dieu , ont accepté de l’étudier régulièrement en compagnie des Témoins de Jéhovah .
Ils ne disposent pas non plus des milliers de dollars qu’il faut pour s’en débarrasser correctement . ”
Parfois , on brisait les os pour recueillir la moelle .
Ils partagent les sentiments de frère Alexander Macmillan qui , après avoir servi Dieu avec dévouement pendant plus de 60 ans , a dit : “ Je suis plus déterminé que jamais à persévérer dans ma foi .
La Bible est le livre le plus fiable sur ce sujet ; elle contient de nombreuses prières d’hommes et de femmes fidèles .
D’autres ont accepté d’étudier la Bible avec nous mais ne veulent pas s’engager .
” Il est important de noter qu’un seul symptôme n’est pas le

In [99]:
# checking if joeymnt is installed and skip the installation 

In [100]:
!pip3 freeze | grep joeynmt

joeynmt==0.0.1


In [101]:
!pip3 freeze | grep subword-nmt

subword-nmt==0.3.6


In [102]:
!which subword-nmt

/usr/local/bin/subword-nmt


### Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for agglutinative languages (a feature of most Bantu languages) is using BPE tokenization (Sennrich, 2015) .

- It was also shown that by optimizing the umber of BPE codes we significantly improve results for low-resourced languages (Sennrich, 2019) (Martinus, 2019)

- Below we have the scripts for doing BPE tokenization of our data. We use 4000 tokens as recommended by (Sennrich, 2019). You do not need to change anything. Simply running the below will be suitable.

In [10]:
from os import path

create the new data folder from the working directory since joeymnt is already installed we are using another data folder

In [104]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [11]:
# Learn BPEs on the training data.
os.environ["data_path"] = path.join("data", source_language + target_language) # Herma

In [106]:
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt


In [107]:
# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt

creating the data folder from my language and copying everything to it

In [108]:
! mkdir -p $data_path
! cp train.* $data_path
! cp test.* $data_path
! cp dev.* $data_path
! cp bpe.codes.4000 $data_path
! ls $data_path

bpe.codes.4000	test.bpe.fr	  test.en-any.en.2  test.fr-swc.fr   train.fr
dev.bpe.fr	test.bpe.swc	  test.en-swc.en    test.fr-swc.swc  train.swc
dev.bpe.swc	test.en		  test.en-swc.swc   test.swc	     vocab.txt
dev.fr		test.en-any.en	  test.fr	    train.bpe.fr
dev.swc		test.en-any.en.1  test.fr-any.fr    train.bpe.swc


Creating the vocabulary using joeymnt 

getting the script manually and checking  and run it

In [None]:
!wget https://raw.githubusercontent.com/joeynmt/joeynmt/master/scripts/build_vocab.py

In [109]:
!python3 build_vocab.py data/$src$tgt/train.bpe.$src data/$src$tgt/train.bpe.$tgt --output_path data/$src$tgt/vocab.txt

In [110]:
# Some output
! echo "BPE Swahili congo Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 data/$src$tgt/vocab.txt 

BPE Swahili congo Sentences
Ndiyo sababu ni jambo la maana wazazi wasi@@ a@@ che ku@@ fundisha watoto wao . ”
A@@ cha watoto wako wat@@ amb@@ ue kama una@@ ona Yehova kuwa mutu wa kweli kabisa .
Wazazi hao wan@@ asema hivi : “ Tuli@@ ambia pia mut@@ oto wetu mu@@ kubwa mwanam@@ uke kama am@@ ut@@ egem@@ ee kabisa Yehova , a@@ endelee kufanya mengi katika kazi ya Ufalme , na asi@@ hangai@@ ke sana .
Wakati ana@@ ona matokeo , ana@@ elewa kama Yehova ana@@ endelea kutusaidia .
Hilo li@@ m@@ em@@ usaidia sana a@@ kuwe na imani yenye nguvu kwa Mungu na katika Biblia . ”
Combined BPE Vocab
і@@
д@@
і
<
Ō@@
ù@@
~
£
γ
ז@@


### Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture
- We set our dropout to reasonably high: 0.3 (recommended in (Sennrich, 2019))
Things worth playing with:

- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
Evaluation metrics (BLEU versus Crhf4)

In [111]:
!ls 

JW300_latest_xml_fr-swc.xml		     fr-swc-baseline_v2
JW300_latest_xml_fr.zip			     jw300.fr
JW300_latest_xml_swc.zip		     jw300.swc
baseline.ipynb				     models
bpe.codes.4000				     test.bpe.fr
buiding_french_global_test_set.ipynb	     test.bpe.swc
buiding_swahili_congo_french_test_set.ipynb  test.en-swc.en
build_vocab.py				     test.en-swc.swc
config					     test.fr
data					     test.fr-any.fr
data_process_without_duplicates.csv	     test.swc
dev.bpe.fr				     train.bpe.fr
dev.bpe.swc				     train.bpe.swc
dev.fr					     train.fr
dev.swc					     train.swc
en-fr-baseline				     vocab.fr
fr-swc-baseline				     vocab.swc


remove the old model folder

In [120]:
! rm -fr models

In [121]:
!mkdir models

In [122]:
!mkdir config

mkdir: cannot create directory ‘config’: File exists


In [6]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '{}{}'.format(source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 50                   # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3, 10, 15, 20, 30, 35]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8            # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 512   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 512        # TODO: Increase to 512 for larger data.
        ff_size: 2048           # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8            # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 512    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 512         # TODO: Increase to 512 for larger data.
        ff_size: 2048            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("config/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [7]:
!ls config

transformer_frswc.yaml


### Train the Model

Trying to fix the issue with tensorboard not found

In [125]:
# !pip3 freeze | grep tensorboard

In [126]:
# !pip3 install tensorboard==1.14.0

In [127]:
#!pip3 install --upgrade torch

This single line of joeynmt runs the training using the config we made above

In [None]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
!python3 -m joeynmt train config/transformer_$src$tgt.yaml

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
2020-02-13 10:27:39,250 Hello! This is Joey-NMT.
2020-02-13 10:27:39,256 Total params: 46426624
2020-02-13 10:27:39,258 Trainable parameters: ['decoder.layer_norm.bias', 'decoder.layer_norm.weight', 'decoder.layers.0.dec_layer_norm.bias', 'decoder.layers.0.dec_layer_norm.weight', 'decoder.layers.0.feed_forward.layer_norm.bias', 'decoder.layers.0.feed_forward.layer_norm.weight', 'decoder.layers.0.feed_forward.pwff_layer.0.bias', 'decoder.layers.0.feed_forward.pwff_layer.0.weight', 'decoder.layers.0.feed_forward.pwff_layer.3.bias', 'decoder.layers.0.feed_forward.pwff_layer.3.weight', 'decoder.layers.0.src_trg_att.k_layer.bias', 'decoder.layers.0.src_trg_att.k_layer.weight', 'decoder.l

2020-02-13 10:27:44,469 cfg.name                           : frswc_transformer
2020-02-13 10:27:44,469 cfg.data.src                       : fr
2020-02-13 10:27:44,469 cfg.data.trg                       : swc
2020-02-13 10:27:44,469 cfg.data.train                     : data/frswc/train.bpe
2020-02-13 10:27:44,469 cfg.data.dev                       : data/frswc/dev.bpe
2020-02-13 10:27:44,469 cfg.data.test                      : data/frswc/test.bpe
2020-02-13 10:27:44,469 cfg.data.level                     : bpe
2020-02-13 10:27:44,469 cfg.data.lowercase                 : False
2020-02-13 10:27:44,469 cfg.data.max_sent_length           : 100
2020-02-13 10:27:44,470 cfg.data.src_vocab                 : data/frswc/vocab.txt
2020-02-13 10:27:44,470 cfg.data.trg_vocab                 : data/frswc/vocab.txt
2020-02-13 10:27:44,470 cfg.testing.beam_size              : 5
2020-02-13 10:27:44,470 cfg.testing.alpha                  : 1.0
2020-02-13 10:27:44,470 cfg.training.random_seed           :

2020-02-13 10:43:56,720 Epoch   1 Step:     1100 Batch Loss:     4.558829 Tokens per Sec:     3043, Lr: 0.000300
2020-02-13 10:45:03,917 Epoch   1 Step:     1200 Batch Loss:     4.192190 Tokens per Sec:     3063, Lr: 0.000300
2020-02-13 10:46:10,380 Epoch   1 Step:     1300 Batch Loss:     4.008200 Tokens per Sec:     3050, Lr: 0.000300
2020-02-13 10:47:15,705 Epoch   1 Step:     1400 Batch Loss:     4.137323 Tokens per Sec:     3088, Lr: 0.000300
2020-02-13 10:48:21,979 Epoch   1 Step:     1500 Batch Loss:     3.771139 Tokens per Sec:     3060, Lr: 0.000300
2020-02-13 10:49:28,269 Epoch   1 Step:     1600 Batch Loss:     3.627386 Tokens per Sec:     3090, Lr: 0.000300
2020-02-13 10:50:35,448 Epoch   1 Step:     1700 Batch Loss:     3.456552 Tokens per Sec:     3048, Lr: 0.000300
2020-02-13 10:51:42,802 Epoch   1 Step:     1800 Batch Loss:     3.673240 Tokens per Sec:     3140, Lr: 0.000300
2020-02-13 10:52:48,498 Epoch   1 Step:     1900 Batch Loss:     3.750333 Tokens per Sec:     30

2020-02-13 11:14:01,989 Epoch   1 Step:     3100 Batch Loss:     3.326180 Tokens per Sec:     3061, Lr: 0.000300
2020-02-13 11:15:08,836 Epoch   1 Step:     3200 Batch Loss:     3.256266 Tokens per Sec:     3093, Lr: 0.000300
2020-02-13 11:16:15,226 Epoch   1 Step:     3300 Batch Loss:     2.836759 Tokens per Sec:     3071, Lr: 0.000300
2020-02-13 11:17:20,907 Epoch   1 Step:     3400 Batch Loss:     3.258413 Tokens per Sec:     3071, Lr: 0.000300
2020-02-13 11:18:27,093 Epoch   1 Step:     3500 Batch Loss:     3.192358 Tokens per Sec:     3034, Lr: 0.000300
2020-02-13 11:19:32,799 Epoch   1 Step:     3600 Batch Loss:     2.874379 Tokens per Sec:     3064, Lr: 0.000300
2020-02-13 11:20:39,387 Epoch   1 Step:     3700 Batch Loss:     3.178644 Tokens per Sec:     3125, Lr: 0.000300
2020-02-13 11:21:46,154 Epoch   1 Step:     3800 Batch Loss:     2.896233 Tokens per Sec:     3029, Lr: 0.000300
2020-02-13 11:22:52,895 Epoch   1 Step:     3900 Batch Loss:     2.805963 Tokens per Sec:     30

2020-02-13 11:44:08,689 Epoch   1 Step:     5100 Batch Loss:     3.052882 Tokens per Sec:     3139, Lr: 0.000300
2020-02-13 11:45:14,484 Epoch   1 Step:     5200 Batch Loss:     2.692811 Tokens per Sec:     3056, Lr: 0.000300
2020-02-13 11:46:21,032 Epoch   1 Step:     5300 Batch Loss:     2.750236 Tokens per Sec:     3117, Lr: 0.000300
2020-02-13 11:47:28,525 Epoch   1 Step:     5400 Batch Loss:     2.762382 Tokens per Sec:     3093, Lr: 0.000300
2020-02-13 11:48:34,377 Epoch   1 Step:     5500 Batch Loss:     2.917958 Tokens per Sec:     3053, Lr: 0.000300
2020-02-13 11:49:40,216 Epoch   1 Step:     5600 Batch Loss:     2.864330 Tokens per Sec:     3069, Lr: 0.000300
2020-02-13 11:50:47,365 Epoch   1 Step:     5700 Batch Loss:     2.948759 Tokens per Sec:     3091, Lr: 0.000300
2020-02-13 11:51:54,050 Epoch   1 Step:     5800 Batch Loss:     2.580101 Tokens per Sec:     3060, Lr: 0.000300
2020-02-13 11:53:01,419 Epoch   1 Step:     5900 Batch Loss:     2.778599 Tokens per Sec:     30

2020-02-13 12:14:14,814 Epoch   2 Step:     7100 Batch Loss:     2.750644 Tokens per Sec:     3053, Lr: 0.000300
2020-02-13 12:15:21,438 Epoch   2 Step:     7200 Batch Loss:     2.655128 Tokens per Sec:     3078, Lr: 0.000300
2020-02-13 12:16:28,129 Epoch   2 Step:     7300 Batch Loss:     2.415816 Tokens per Sec:     3069, Lr: 0.000300
2020-02-13 12:17:35,079 Epoch   2 Step:     7400 Batch Loss:     2.332900 Tokens per Sec:     3085, Lr: 0.000300
2020-02-13 12:18:40,823 Epoch   2 Step:     7500 Batch Loss:     2.408705 Tokens per Sec:     3032, Lr: 0.000300
2020-02-13 12:19:46,980 Epoch   2 Step:     7600 Batch Loss:     2.394734 Tokens per Sec:     3093, Lr: 0.000300
2020-02-13 12:20:53,997 Epoch   2 Step:     7700 Batch Loss:     2.406074 Tokens per Sec:     3105, Lr: 0.000300
2020-02-13 12:22:00,645 Epoch   2 Step:     7800 Batch Loss:     2.686860 Tokens per Sec:     3050, Lr: 0.000300
2020-02-13 12:23:06,346 Epoch   2 Step:     7900 Batch Loss:     2.542987 Tokens per Sec:     30

2020-02-13 12:44:18,942 Epoch   2 Step:     9100 Batch Loss:     2.424658 Tokens per Sec:     3097, Lr: 0.000300
2020-02-13 12:45:25,782 Epoch   2 Step:     9200 Batch Loss:     2.756666 Tokens per Sec:     3104, Lr: 0.000300
2020-02-13 12:46:32,193 Epoch   2 Step:     9300 Batch Loss:     2.483221 Tokens per Sec:     3126, Lr: 0.000300
2020-02-13 12:47:37,240 Epoch   2 Step:     9400 Batch Loss:     2.309789 Tokens per Sec:     3035, Lr: 0.000300
2020-02-13 12:48:44,456 Epoch   2 Step:     9500 Batch Loss:     2.617984 Tokens per Sec:     3147, Lr: 0.000300
2020-02-13 12:49:50,397 Epoch   2 Step:     9600 Batch Loss:     2.087964 Tokens per Sec:     3030, Lr: 0.000300
2020-02-13 12:50:56,997 Epoch   2 Step:     9700 Batch Loss:     2.237980 Tokens per Sec:     3066, Lr: 0.000300
2020-02-13 12:52:02,967 Epoch   2 Step:     9800 Batch Loss:     2.613254 Tokens per Sec:     3032, Lr: 0.000300
2020-02-13 12:53:09,564 Epoch   2 Step:     9900 Batch Loss:     2.342013 Tokens per Sec:     31

2020-02-13 13:14:25,292 Epoch   2 Step:    11100 Batch Loss:     2.285390 Tokens per Sec:     3099, Lr: 0.000300
2020-02-13 13:15:31,624 Epoch   2 Step:    11200 Batch Loss:     2.292525 Tokens per Sec:     3086, Lr: 0.000300
2020-02-13 13:17:44,938 Epoch   2 Step:    11400 Batch Loss:     2.375286 Tokens per Sec:     3074, Lr: 0.000300
2020-02-13 13:18:51,282 Epoch   2 Step:    11500 Batch Loss:     2.447987 Tokens per Sec:     3046, Lr: 0.000300
2020-02-13 13:19:58,140 Epoch   2 Step:    11600 Batch Loss:     2.218121 Tokens per Sec:     3072, Lr: 0.000300
2020-02-13 13:21:04,425 Epoch   2 Step:    11700 Batch Loss:     2.157409 Tokens per Sec:     3063, Lr: 0.000300
2020-02-13 13:22:11,588 Epoch   2 Step:    11800 Batch Loss:     2.219285 Tokens per Sec:     3114, Lr: 0.000300
2020-02-13 13:23:17,401 Epoch   2 Step:    11900 Batch Loss:     2.222605 Tokens per Sec:     3098, Lr: 0.000300
2020-02-13 13:24:24,188 Epoch   2 Step:    12000 Batch Loss:     2.433710 Tokens per Sec:     30

2020-02-13 13:44:29,273 Epoch   2 Step:    13100 Batch Loss:     2.325762 Tokens per Sec:     3134, Lr: 0.000300
2020-02-13 13:45:23,353 Epoch   2: total training loss 15970.08
2020-02-13 13:45:23,353 EPOCH 3
2020-02-13 13:45:36,088 Epoch   3 Step:    13200 Batch Loss:     2.586302 Tokens per Sec:     2889, Lr: 0.000300
2020-02-13 13:46:42,030 Epoch   3 Step:    13300 Batch Loss:     2.261342 Tokens per Sec:     3042, Lr: 0.000300
2020-02-13 13:47:48,610 Epoch   3 Step:    13400 Batch Loss:     2.160784 Tokens per Sec:     3124, Lr: 0.000300
2020-02-13 13:48:55,039 Epoch   3 Step:    13500 Batch Loss:     2.480360 Tokens per Sec:     3099, Lr: 0.000300
2020-02-13 13:51:09,349 Epoch   3 Step:    13700 Batch Loss:     1.981722 Tokens per Sec:     3093, Lr: 0.000300
2020-02-13 13:52:15,976 Epoch   3 Step:    13800 Batch Loss:     2.083113 Tokens per Sec:     3156, Lr: 0.000300
2020-02-13 13:53:21,723 Epoch   3 Step:    13900 Batch Loss:     2.070894 Tokens per Sec:     3054, Lr: 0.000300


2020-02-13 14:14:32,789 Epoch   3 Step:    15100 Batch Loss:     2.222872 Tokens per Sec:     3132, Lr: 0.000300
2020-02-13 14:15:38,904 Epoch   3 Step:    15200 Batch Loss:     2.195426 Tokens per Sec:     3044, Lr: 0.000300
2020-02-13 14:16:44,537 Epoch   3 Step:    15300 Batch Loss:     2.243739 Tokens per Sec:     3053, Lr: 0.000300
2020-02-13 14:17:50,546 Epoch   3 Step:    15400 Batch Loss:     2.060385 Tokens per Sec:     3081, Lr: 0.000300
2020-02-13 14:18:56,952 Epoch   3 Step:    15500 Batch Loss:     2.117823 Tokens per Sec:     3069, Lr: 0.000300
2020-02-13 14:20:02,499 Epoch   3 Step:    15600 Batch Loss:     2.067538 Tokens per Sec:     3034, Lr: 0.000300
2020-02-13 14:21:08,341 Epoch   3 Step:    15700 Batch Loss:     2.123886 Tokens per Sec:     3089, Lr: 0.000300
2020-02-13 14:22:15,031 Epoch   3 Step:    15800 Batch Loss:     2.252592 Tokens per Sec:     3090, Lr: 0.000300
2020-02-13 14:23:21,822 Epoch   3 Step:    15900 Batch Loss:     1.752465 Tokens per Sec:     30

2020-02-13 14:44:32,223 Epoch   3 Step:    17100 Batch Loss:     2.063174 Tokens per Sec:     3097, Lr: 0.000300
2020-02-13 14:45:38,621 Epoch   3 Step:    17200 Batch Loss:     2.196168 Tokens per Sec:     3034, Lr: 0.000300
2020-02-13 14:46:45,951 Epoch   3 Step:    17300 Batch Loss:     1.837460 Tokens per Sec:     3159, Lr: 0.000300
2020-02-13 14:47:51,476 Epoch   3 Step:    17400 Batch Loss:     2.279250 Tokens per Sec:     3033, Lr: 0.000300
2020-02-13 14:48:58,028 Epoch   3 Step:    17500 Batch Loss:     1.921667 Tokens per Sec:     3094, Lr: 0.000300
2020-02-13 14:50:04,624 Epoch   3 Step:    17600 Batch Loss:     2.051232 Tokens per Sec:     3091, Lr: 0.000300
2020-02-13 14:51:10,133 Epoch   3 Step:    17700 Batch Loss:     2.162503 Tokens per Sec:     3067, Lr: 0.000300
2020-02-13 14:52:15,613 Epoch   3 Step:    17800 Batch Loss:     2.190103 Tokens per Sec:     3047, Lr: 0.000300
2020-02-13 14:53:22,612 Epoch   3 Step:    17900 Batch Loss:     1.962997 Tokens per Sec:     31

2020-02-13 15:14:33,696 Epoch   3 Step:    19100 Batch Loss:     1.976501 Tokens per Sec:     3075, Lr: 0.000300
2020-02-13 15:15:40,421 Epoch   3 Step:    19200 Batch Loss:     2.078254 Tokens per Sec:     3104, Lr: 0.000300
2020-02-13 15:16:48,080 Epoch   3 Step:    19300 Batch Loss:     1.978653 Tokens per Sec:     3130, Lr: 0.000300
2020-02-13 15:17:53,928 Epoch   3 Step:    19400 Batch Loss:     2.059787 Tokens per Sec:     3065, Lr: 0.000300
2020-02-13 15:19:00,368 Epoch   3 Step:    19500 Batch Loss:     2.308555 Tokens per Sec:     3051, Lr: 0.000300
2020-02-13 15:20:06,941 Epoch   3 Step:    19600 Batch Loss:     1.970548 Tokens per Sec:     3094, Lr: 0.000300
2020-02-13 15:21:14,402 Epoch   3 Step:    19700 Batch Loss:     2.170464 Tokens per Sec:     3132, Lr: 0.000300
2020-02-13 15:22:07,390 Epoch   3: total training loss 13917.49
2020-02-13 15:22:07,390 EPOCH 4
2020-02-13 15:22:21,982 Epoch   4 Step:    19800 Batch Loss:     2.074150 Tokens per Sec:     3073, Lr: 0.000300


2020-02-13 15:44:37,968 Epoch   4 Step:    21100 Batch Loss:     2.030633 Tokens per Sec:     3093, Lr: 0.000300
2020-02-13 15:45:43,896 Epoch   4 Step:    21200 Batch Loss:     2.071697 Tokens per Sec:     3072, Lr: 0.000300
2020-02-13 15:46:49,520 Epoch   4 Step:    21300 Batch Loss:     1.853978 Tokens per Sec:     3046, Lr: 0.000300
2020-02-13 15:47:56,876 Epoch   4 Step:    21400 Batch Loss:     1.915925 Tokens per Sec:     3132, Lr: 0.000300
2020-02-13 15:49:02,406 Epoch   4 Step:    21500 Batch Loss:     1.919946 Tokens per Sec:     3094, Lr: 0.000300
2020-02-13 15:50:09,138 Epoch   4 Step:    21600 Batch Loss:     1.969241 Tokens per Sec:     3095, Lr: 0.000300
2020-02-13 15:51:15,559 Epoch   4 Step:    21700 Batch Loss:     2.111245 Tokens per Sec:     3088, Lr: 0.000300
2020-02-13 15:52:22,926 Epoch   4 Step:    21800 Batch Loss:     1.973348 Tokens per Sec:     3073, Lr: 0.000300
2020-02-13 15:53:29,445 Epoch   4 Step:    21900 Batch Loss:     2.006958 Tokens per Sec:     31

In [22]:
! tail -10 models/frswc_transformer/train.log

2020-02-16 20:58:59,082 Epoch  50 Step:   329500 Batch Loss:     1.093634 Tokens per Sec:     2922, Lr: 0.000008
2020-02-16 21:00:09,071 Epoch  50 Step:   329600 Batch Loss:     1.067766 Tokens per Sec:     3013, Lr: 0.000008
2020-02-16 21:01:18,867 Epoch  50 Step:   329700 Batch Loss:     1.089992 Tokens per Sec:     2963, Lr: 0.000008
2020-02-16 21:01:20,211 Epoch  50: total training loss 7458.01
2020-02-16 21:01:20,212 Training ended after  50 epochs.
2020-02-16 21:01:20,212 Best validation result at step   329000:   3.61 ppl.
2020-02-16 21:03:49,905  dev bleu:  26.73 [Beam search decoding with beam size = 5 and alpha = 1.0]
2020-02-16 21:03:49,907 Translations saved to: models/frswc_transformer/00329000.hyps.dev
2020-02-16 21:07:32,639 test bleu:  33.73 [Beam search decoding with beam size = 5 and alpha = 1.0]
2020-02-16 21:07:32,643 Translations saved to: models/frswc_transformer/00329000.hyps.test


In [130]:
!ls models/${src}${tgt}_transformer

00329000.hyps.dev   16000.hyps	 221000.hyps  284000.hyps  47000.hyps
00329000.hyps.test  160000.hyps  222000.hyps  285000.hyps  48000.hyps
1000.hyps	    161000.hyps  223000.hyps  286000.hyps  49000.hyps
10000.hyps	    162000.hyps  224000.hyps  287000.hyps  5000.hyps
100000.hyps	    163000.hyps  225000.hyps  288000.hyps  50000.hyps
101000.hyps	    164000.hyps  226000.hyps  289000.hyps  51000.hyps
102000.hyps	    165000.hyps  227000.hyps  29000.hyps   52000.hyps
103000.hyps	    166000.hyps  228000.hyps  290000.hyps  53000.hyps
104000.hyps	    167000.hyps  229000.hyps  291000.hyps  54000.hyps
105000.hyps	    168000.hyps  23000.hyps   292000.hyps  55000.hyps
106000.hyps	    169000.hyps  230000.hyps  293000.hyps  56000.hyps
107000.hyps	    17000.hyps	 231000.hyps  294000.hyps  57000.hyps
108000.hyps	    170000.hyps  232000.hyps  295000.hyps  58000.hyps
109000.hyps	    171000.hyps  233000.hyps  296000.hyps  59000.hyps
11000.hyps	    172000.hyps  234000.hyps  297000.hyps  6000.h

In [131]:
! tail models/${src}${tgt}_transformer/train.log

2020-02-16 20:58:59,082 Epoch  50 Step:   329500 Batch Loss:     1.093634 Tokens per Sec:     2922, Lr: 0.000008
2020-02-16 21:00:09,071 Epoch  50 Step:   329600 Batch Loss:     1.067766 Tokens per Sec:     3013, Lr: 0.000008
2020-02-16 21:01:18,867 Epoch  50 Step:   329700 Batch Loss:     1.089992 Tokens per Sec:     2963, Lr: 0.000008
2020-02-16 21:01:20,211 Epoch  50: total training loss 7458.01
2020-02-16 21:01:20,212 Training ended after  50 epochs.
2020-02-16 21:01:20,212 Best validation result at step   329000:   3.61 ppl.
2020-02-16 21:03:49,905  dev bleu:  26.73 [Beam search decoding with beam size = 5 and alpha = 1.0]
2020-02-16 21:03:49,907 Translations saved to: models/frswc_transformer/00329000.hyps.dev
2020-02-16 21:07:32,639 test bleu:  33.73 [Beam search decoding with beam size = 5 and alpha = 1.0]
2020-02-16 21:07:32,643 Translations saved to: models/frswc_transformer/00329000.hyps.test


In [132]:
! cat models/${src}${tgt}_transformer/validations.txt

Steps: 1000	Loss: 117640.85156	PPL: 72.96771	bleu: 0.30677	LR: 0.00030000	*
Steps: 2000	Loss: 96072.33594	PPL: 33.23078	bleu: 1.44532	LR: 0.00030000	*
Steps: 3000	Loss: 86643.24219	PPL: 23.56177	bleu: 2.57306	LR: 0.00030000	*
Steps: 4000	Loss: 80421.35938	PPL: 18.77885	bleu: 3.81392	LR: 0.00030000	*
Steps: 5000	Loss: 75786.34375	PPL: 15.85851	bleu: 5.58060	LR: 0.00030000	*
Steps: 6000	Loss: 72432.73438	PPL: 14.03297	bleu: 6.41285	LR: 0.00030000	*
Steps: 7000	Loss: 70071.18750	PPL: 12.87504	bleu: 7.37862	LR: 0.00030000	*
Steps: 8000	Loss: 66923.03906	PPL: 11.47863	bleu: 8.53179	LR: 0.00030000	*
Steps: 9000	Loss: 64534.80078	PPL: 10.52123	bleu: 9.22430	LR: 0.00030000	*
Steps: 10000	Loss: 62539.73438	PPL: 9.78295	bleu: 10.64979	LR: 0.00030000	*
Steps: 11000	Loss: 60636.72656	PPL: 9.12706	bleu: 11.01825	LR: 0.00030000	*
Steps: 12000	Loss: 59227.60156	PPL: 8.66990	bleu: 11.86854	LR: 0.00030000	*
Steps: 13000	Loss: 58186.77344	PPL: 8.34699	bleu: 12.08454	LR: 0.00030000	*
Steps: 

In [133]:
!python3 -m joeynmt test config/transformer_$src$tgt.yaml

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
2020-02-17 06:41:57,032 -  dev bleu:  26.73 [Beam search decoding with beam size = 5 and alpha = 1.0]
2020-02-17 06:45:36,029 - test bleu:  33.73 [Beam search decoding with beam size = 5 and alpha = 1.0]


Save the results to the folder

In [135]:
!mkdir $gdrive_path/models
!mkdir $gdrive_path/models/${src}${tgt}_transformer/
# Copy the created models from the notebook storage to google drive for persistant storage 


cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/329000.ckpt': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/329000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/33000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/34000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/35000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/36000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/37000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/38000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/39000.hyps': No space left on device
cp: error writing 'fr-swc-baseline_v2/models/frswc_transformer/4000.hyps': No space left 

In [136]:
!ls models/${src}${tgt}_transformer

00329000.hyps.dev   159000.hyps  22000.hyps   281000.hyps  46000.hyps
00329000.hyps.test  16000.hyps	 220000.hyps  282000.hyps  47000.hyps
1000.hyps	    160000.hyps  221000.hyps  283000.hyps  48000.hyps
10000.hyps	    161000.hyps  222000.hyps  284000.hyps  49000.hyps
100000.hyps	    162000.hyps  223000.hyps  285000.hyps  5000.hyps
101000.hyps	    163000.hyps  224000.hyps  286000.hyps  50000.hyps
102000.hyps	    164000.hyps  225000.hyps  287000.hyps  51000.hyps
103000.hyps	    165000.hyps  226000.hyps  288000.hyps  52000.hyps
104000.hyps	    166000.hyps  227000.hyps  289000.hyps  53000.hyps
105000.hyps	    167000.hyps  228000.hyps  29000.hyps   54000.hyps
106000.hyps	    168000.hyps  229000.hyps  290000.hyps  55000.hyps
107000.hyps	    169000.hyps  23000.hyps   291000.hyps  56000.hyps
108000.hyps	    17000.hyps	 230000.hyps  292000.hyps  57000.hyps
109000.hyps	    170000.hyps  231000.hyps  293000.hyps  58000.hyps
11000.hyps	    171000.hyps  232000.hyps  294000.hyps  59000.

In [140]:
!cp -r models/${src}${tgt}_transformer/best.ckpt "$gdrive_path/models/${src}${tgt}_transformer/"

In [141]:
!cp -r models/${src}${tgt}_transformer/config.yaml "$gdrive_path/models/${src}${tgt}_transformer/"

In [142]:
!cp -r models/${src}${tgt}_transformer/src_vocab.txt "$gdrive_path/models/${src}${tgt}_transformer/"

In [143]:
!cp -r models/${src}${tgt}_transformer/trg_vocab.txt "$gdrive_path/models/${src}${tgt}_transformer/"

In [144]:
!cp -r models/${src}${tgt}_transformer/validations.txt "$gdrive_path/models/${src}${tgt}_transformer/"

In [145]:
!ls $data_path

bpe.codes.4000	test.bpe.fr	  test.en-any.en.2  test.fr-swc.fr   train.fr
dev.bpe.fr	test.bpe.swc	  test.en-swc.en    test.fr-swc.swc  train.swc
dev.bpe.swc	test.en		  test.en-swc.swc   test.swc	     vocab.txt
dev.fr		test.en-any.en	  test.fr	    train.bpe.fr
dev.swc		test.en-any.en.1  test.fr-any.fr    train.bpe.swc


In [152]:
!cp $data_path/test.fr "$gdrive_path/models/${src}${tgt}_transformer/"

In [153]:
!cp $data_path/test.swc "$gdrive_path/models/${src}${tgt}_transformer/"

In [154]:
!cp $data_path/vocab.txt "$gdrive_path/models/${src}${tgt}_transformer/"

In [12]:
!cp $data_path/train.bpe.swc "$gdrive_path/models/${src}${tgt}_transformer/"

In [13]:
!cp $data_path/train.bpe.fr "$gdrive_path/models/${src}${tgt}_transformer/"

In [17]:
!ls fr-swc-baseline_v2/models/${src}${tgt}_transformer/

00329000.hyps.dev   16000.hyps	 221000.hyps  284000.hyps  49000.hyps
00329000.hyps.test  160000.hyps  222000.hyps  285000.hyps  5000.hyps
1000.hyps	    161000.hyps  223000.hyps  286000.hyps  50000.hyps
10000.hyps	    162000.hyps  224000.hyps  287000.hyps  51000.hyps
100000.hyps	    163000.hyps  225000.hyps  288000.hyps  52000.hyps
101000.hyps	    164000.hyps  226000.hyps  289000.hyps  53000.hyps
102000.hyps	    165000.hyps  227000.hyps  29000.hyps   54000.hyps
103000.hyps	    166000.hyps  228000.hyps  290000.hyps  55000.hyps
104000.hyps	    167000.hyps  229000.hyps  291000.hyps  56000.hyps
105000.hyps	    168000.hyps  23000.hyps   292000.hyps  57000.hyps
106000.hyps	    169000.hyps  230000.hyps  293000.hyps  58000.hyps
107000.hyps	    17000.hyps	 231000.hyps  294000.hyps  59000.hyps
108000.hyps	    170000.hyps  232000.hyps  295000.hyps  6000.hyps
109000.hyps	    171000.hyps  233000.hyps  296000.hyps  60000.hyps
11000.hyps	    172000.hyps  234000.hyps  297000.hyps  61000.h

In [26]:
! tail -20 ../french_lingala_espoir/models/frln_transformer/train.log

2020-02-18 13:41:36,642 Example #5
2020-02-18 13:41:36,642 	Raw source:     ['Peut', '-', 'être', 'comp@@', 'ren@@', 'ez', '-', '\u200b', 'vous', 'ses', 'pro@@', 'ches', '.']
2020-02-18 13:41:36,642 	Raw hypothesis: ['Mbala', 'mosusu', 'oyebi', 'bandeko', 'na', 'ye', '.']
2020-02-18 13:41:36,642 	Source:     Peut - être comprenez - ​ vous ses proches .
2020-02-18 13:41:36,642 	Reference:  Ntango mosusu okoki kozala na likanisi ndenge moko na bato ya libota na ye .
2020-02-18 13:41:36,642 	Hypothesis: Mbala mosusu oyebi bandeko na ye .
2020-02-18 13:41:36,642 Example #10
2020-02-18 13:41:36,642 	Raw source:     ['L’@@', 'endur@@', 'ance', 'face', 'à', 'de', 't@@', 'elles', 'épreuves', 'est', 'particulièrement', 'préci@@', 'euse', 'pour', 'Jéhovah', '.']
2020-02-18 13:41:36,642 	Raw hypothesis: ['Koy@@', 'ika', 'mpiko', 'na', 'komekama', 'ya', 'ndenge', 'wana', 'ezali', 'na', 'ntina', 'mingi', 'mpo', 'na', 'Yehova', '.']
2020-02-18 13:41:36,642 	Source:     L’endurance face à de

In [14]:
!cp $data_path/test.bpe.swc "$gdrive_path/models/${src}${tgt}_transformer/"
!cp $data_path/test.bpe.fr "$gdrive_path/models/${src}${tgt}_transformer/"

In [23]:
['espoir'][1]

IndexError: list index out of range