---
library_name: transformers
language:
- rcf
---

# Tokenizer for Réunion Creole 🇷🇪

This tokenizer is specifically designed for working with **Réunion Creole**, a language primarily spoken on the island of Réunion. It is based on the **Byte Pair Encoding (BPE)** model and optimized for the lexical and orthographic specificities of the language.

## Features

- Built using the **BPE (Byte Pair Encoding)** model.
- Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book.
- Supports special tokens for common NLP tasks:
  - `[CLS]`: Start-of-sequence token for classification tasks.
  - `[SEP]`: Separator token for multi-segment inputs.
  - `[PAD]`: Padding token.
  - `[MASK]`: Masking token used for training masked language models.
  - `[UNK]`: Token for unknown words.

## Usage

### Loading the Tokenizer

You can easily load this tokenizer using the `transformers` library:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer")

# Example of tokenization
text = "Comment i lé zot tout ?"
tokens = tokenizer.encode(text)
print(tokens)
```


Hugo How-Choong