--- library_name: transformers language: - rcf --- # Tokenizer for Réunion Creole 🇷🇪 This tokenizer is specifically designed for working with **Réunion Creole**, a language primarily spoken on the island of Réunion. It is based on the **Byte Pair Encoding (BPE)** model and optimized for the lexical and orthographic specificities of the language. ## Features - Built using the **BPE (Byte Pair Encoding)** model. - Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book. - Supports special tokens for common NLP tasks: - `[CLS]`: Start-of-sequence token for classification tasks. - `[SEP]`: Separator token for multi-segment inputs. - `[PAD]`: Padding token. - `[MASK]`: Masking token used for training masked language models. - `[UNK]`: Token for unknown words. ## Usage ### Loading the Tokenizer You can easily load this tokenizer using the `transformers` library: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer") # Example of tokenization text = "Comment i lé zot tout ?" tokens = tokenizer.encode(text) print(tokens) ``` Hugo How-Choong