BPE based tokenizer used for the MEHDIE project and the training of a bilingual BERT model. Vocab size of 52000.