Tokenizer for Réunion Creole 🇷🇪

This tokenizer is specifically designed for working with Réunion Creole, a language primarily spoken on the island of Réunion. It is based on the Byte Pair Encoding (BPE) model and optimized for the lexical and orthographic specificities of the language.

Features

  • Built using the BPE (Byte Pair Encoding) model.
  • Trained on "LA RIME, Mo i akorde dann bal zakor", a free-access book.
  • Supports special tokens for common NLP tasks:
    • [CLS]: Start-of-sequence token for classification tasks.
    • [SEP]: Separator token for multi-segment inputs.
    • [PAD]: Padding token.
    • [MASK]: Masking token used for training masked language models.
    • [UNK]: Token for unknown words.

Usage

Loading the Tokenizer

You can easily load this tokenizer using the transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hugohow/creole_reunion_tokenizer")

# Example of tokenization
text = "Comment i lé zot tout ?"
tokens = tokenizer.encode(text)
print(tokens)

Hugo How-Choong

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Collection including hugohow/creole_reunion_tokenizer