--- language: - vi - en --- # Description This tokenizer is of type [Unigram](https://arxiv.org/pdf/1804.10959.pdf), supporting both English and Vietnamese. Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: `hóa → hoá`, `hủy → huỷ`. # Details ## Library used to train https://github.com/google/sentencepiece ## Training Data https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer ## Training script ```bash ./spm_train \ --input=vien-corpus.txt \ --model_prefix=vien \ --vocab_size=64000 \ --user_defined_symbols_file=user_defined_symbols.txt \ --required_chars_file=required_chars.txt \ --unk_surface="" \ --byte_fallback=false \ --split_by_unicode_script=true \ --split_by_number=true \ --split_digits=true \ --normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv ``` `spm_train` is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (`user_defined_symbols.txt`, `required_chars.txt` and `nmt_nfkc_vidiacritic.tsv`) are provided in this repo. The training script should be run on a machine with 64GB RAM. After training, we get two files `vien.model` and `vien.vocab`. ## Convert SPM model to HuggingFace tokenizer Run the following python script to convert SPM model to HuggingFace tokenizer. ```python from transformers import DebertaV2Tokenizer tokenizer = DebertaV2Tokenizer( vocab_file="assets/spm/vien.model", do_lower_case=False, split_by_punct=False, bos_token="", eos_token="", unk_token="", sep_token="", pad_token="", cls_token="", mask_token="" ) tokenizer.save_pretrained("assets/hf-tokenizer") ``` Replace `assets/spm/vien.model` and `assets/hf-tokenizer` with the correct path on your local machine. ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False) tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))") print(tokens) # ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))'] ``` Note that you must set `use_fast=False` for the tokenizer to properly function. In case `use_fast=True` (default), the tokenizer cannot perform normalization (Note that in the usage example, `wóa` was changed to `woá`) # Contact information For personal communication related to this project, please contact Loi Le Vu (levuloihust@gmail.com).