Tokenizer for Réunion Creole 🇷🇪

This tokenizer is specifically designed for working with Réunion Creole, a language primarily spoken on the island of Réunion. It is based on the Byte Pair Encoding (BPE) model and optimized for the lexical and orthographic specificities of the language.

Features

Built using the BPE (Byte Pair Encoding) model.
Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book.
Supports special tokens for common NLP tasks:
- [CLS]: Start-of-sequence token for classification tasks.
- [SEP]: Separator token for multi-segment inputs.
- [PAD]: Padding token.
- [MASK]: Masking token used for training masked language models.
- [UNK]: Token for unknown words.

Usage

Loading the Tokenizer

You can easily load this tokenizer using the transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer")

# Example of tokenization
text = "Comment i lé zot tout ?"
tokens = tokenizer.encode(text)
print(tokens)

Hugo How-Choong

hugohow
/

creole_reunion_tokenizer

Tokenizer for Réunion Creole 🇷🇪

Features

Usage

Loading the Tokenizer

Collection including hugohow/creole_reunion_tokenizer

Créole réunionnais / Reunionese Creole