--- license: other language: - ar --- Arabic BPE Tokenization Using Google Sentance Piece. Natural Language Processing is a branch of AI. One of the first steps in any NLP system is language model encoding. The challenge is how to present/encode the words efficiently. Sub-word encoding is very suitable to arabic. For example the word مدرساتهم will not be considered a single token/word, but split into three; مدرس, ات, and هم. This is the basic intuition. This process is done automatically without any rules or preprocessing. Vocab size: 32000 Project: https://github.com/tarekeldeeb/arabic_byte_pair_encoding License: Waqf v2