File size: 2,650 Bytes
8c38300
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
language:
- vi
- en
---

# Description

This tokenizer is of type [Unigram](https://arxiv.org/pdf/1804.10959.pdf), supporting both English and Vietnamese.

Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: `hóa → hoá`, `hủy → huỷ`.

# Details

## Library used to train
https://github.com/google/sentencepiece

## Training Data
https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer

## Training script
```bash
./spm_train \
    --input=vien-corpus.txt \
    --model_prefix=vien \
    --vocab_size=64000 \
    --user_defined_symbols_file=user_defined_symbols.txt \
    --required_chars_file=required_chars.txt \
    --unk_surface="<unk>" \
    --byte_fallback=false \
    --split_by_unicode_script=true \
    --split_by_number=true \
    --split_digits=true \
    --normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv
```
`spm_train` is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (`user_defined_symbols.txt`, `required_chars.txt` and `nmt_nfkc_vidiacritic.tsv`) are provided in this repo.

The training script should be run on a machine with 64GB RAM. After training, we get two files `vien.model` and `vien.vocab`.

## Convert SPM model to HuggingFace tokenizer

Run the following python script to convert SPM model to HuggingFace tokenizer.
```python
from transformers import DebertaV2Tokenizer

tokenizer = DebertaV2Tokenizer(
    vocab_file="assets/spm/vien.model",
    do_lower_case=False,
    split_by_punct=False,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    sep_token="<sep>",
    pad_token="<pad>",
    cls_token="<cls>",
    mask_token="<mask>"
)
tokenizer.save_pretrained("assets/hf-tokenizer")
```
Replace `assets/spm/vien.model` and `assets/hf-tokenizer` with the correct path on your local machine.

## Usage
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
print(tokens)
# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']
```

Note that you must set `use_fast=False` for the tokenizer to properly function. In case `use_fast=True` (default), the tokenizer cannot perform normalization (Note that in the usage example, `wóa` was changed to `woá`)

# Contact information
For personal communication related to this project, please contact Loi Le Vu ([email protected]).