|
--- |
|
language: |
|
- vi |
|
- en |
|
--- |
|
|
|
# Description |
|
|
|
This tokenizer is of type [Unigram](https://arxiv.org/pdf/1804.10959.pdf), supporting both English and Vietnamese. |
|
|
|
Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: `hóa → hoá`, `hủy → huỷ`. |
|
|
|
# Details |
|
|
|
## Library used to train |
|
https://github.com/google/sentencepiece |
|
|
|
## Training Data |
|
https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer |
|
|
|
## Training script |
|
```bash |
|
./spm_train \ |
|
--input=vien-corpus.txt \ |
|
--model_prefix=vien \ |
|
--vocab_size=64000 \ |
|
--user_defined_symbols_file=user_defined_symbols.txt \ |
|
--required_chars_file=required_chars.txt \ |
|
--unk_surface="<unk>" \ |
|
--byte_fallback=false \ |
|
--split_by_unicode_script=true \ |
|
--split_by_number=true \ |
|
--split_digits=true \ |
|
--normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv |
|
``` |
|
`spm_train` is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (`user_defined_symbols.txt`, `required_chars.txt` and `nmt_nfkc_vidiacritic.tsv`) are provided in this repo. |
|
|
|
The training script should be run on a machine with 64GB RAM. After training, we get two files `vien.model` and `vien.vocab`. |
|
|
|
## Convert SPM model to HuggingFace tokenizer |
|
|
|
Run the following python script to convert SPM model to HuggingFace tokenizer. |
|
```python |
|
from transformers import DebertaV2Tokenizer |
|
|
|
tokenizer = DebertaV2Tokenizer( |
|
vocab_file="assets/spm/vien.model", |
|
do_lower_case=False, |
|
split_by_punct=False, |
|
bos_token="<s>", |
|
eos_token="</s>", |
|
unk_token="<unk>", |
|
sep_token="<sep>", |
|
pad_token="<pad>", |
|
cls_token="<cls>", |
|
mask_token="<mask>" |
|
) |
|
tokenizer.save_pretrained("assets/hf-tokenizer") |
|
``` |
|
Replace `assets/spm/vien.model` and `assets/hf-tokenizer` with the correct path on your local machine. |
|
|
|
## Usage |
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False) |
|
tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))") |
|
print(tokens) |
|
# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))'] |
|
``` |
|
|
|
Note that you must set `use_fast=False` for the tokenizer to properly function. In case `use_fast=True` (default), the tokenizer cannot perform normalization (Note that in the usage example, `wóa` was changed to `woá`) |
|
|
|
# Contact information |
|
For personal communication related to this project, please contact Loi Le Vu ([email protected]). |
|
|