levuloihust
/

vien-unigram-tokenizer

Model card Files Files and versions Community

vien-unigram-tokenizer / README.md

levuloihust's picture

Create README.md

8c38300 over 1 year ago

|

history blame contribute delete

2.65 kB

	---
	language:
	- vi
	- en
	---

	# Description

	This tokenizer is of type [Unigram](https://arxiv.org/pdf/1804.10959.pdf), supporting both English and Vietnamese.

	Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: `hóa → hoá`, `hủy → huỷ`.

	# Details

	## Library used to train
	https://github.com/google/sentencepiece

	## Training Data
	https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer

	## Training script
	```bash
	./spm_train \
	--input=vien-corpus.txt \
	--model_prefix=vien \
	--vocab_size=64000 \
	--user_defined_symbols_file=user_defined_symbols.txt \
	--required_chars_file=required_chars.txt \
	--unk_surface="<unk>" \
	--byte_fallback=false \
	--split_by_unicode_script=true \
	--split_by_number=true \
	--split_digits=true \
	--normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv
	```
	`spm_train` is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (`user_defined_symbols.txt`, `required_chars.txt` and `nmt_nfkc_vidiacritic.tsv`) are provided in this repo.

	The training script should be run on a machine with 64GB RAM. After training, we get two files `vien.model` and `vien.vocab`.

	## Convert SPM model to HuggingFace tokenizer

	Run the following python script to convert SPM model to HuggingFace tokenizer.
	```python
	from transformers import DebertaV2Tokenizer

	tokenizer = DebertaV2Tokenizer(
	vocab_file="assets/spm/vien.model",
	do_lower_case=False,
	split_by_punct=False,
	bos_token="<s>",
	eos_token="</s>",
	unk_token="<unk>",
	sep_token="<sep>",
	pad_token="<pad>",
	cls_token="<cls>",
	mask_token="<mask>"
	)
	tokenizer.save_pretrained("assets/hf-tokenizer")
	```
	Replace `assets/spm/vien.model` and `assets/hf-tokenizer` with the correct path on your local machine.

	## Usage
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
	tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
	print(tokens)
	# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']
	```

	Note that you must set `use_fast=False` for the tokenizer to properly function. In case `use_fast=True` (default), the tokenizer cannot perform normalization (Note that in the usage example, `wóa` was changed to `woá`)

	# Contact information
	For personal communication related to this project, please contact Loi Le Vu ([email protected]).