metadata

library_name: transformers
tags: []

PersianGemmaTokenizerFast

A fine-tuned Gemma tokenizer on Persian text, optimized to handle the nuances of the Persian language with improved efficiency and accuracy. This tokenizer is available via the Hugging Face Hub as mshojaei77/PersianGemmaTokenizerFast.

Overview

The PersianGemmaTokenizerFast leverages the robust architecture of the original Gemma tokenizer and is fine-tuned on Persian data. It is designed to provide faster and more accurate tokenization for various Natural Language Processing (NLP) tasks involving Persian text.

Features

Optimized for Persian: Tailored tokenization for Persian language constructs.
Speed and Efficiency: Built on fast tokenization libraries for quick processing.
Compatibility: Works seamlessly with the Hugging Face Transformers library.

Usage

Here is an example of how to use the tokenizer in your Python code:

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianGemmaTokenizerFast")

# Example Persian text
text = "سلام، حال شما چطور است؟"

# Tokenize the text
encoded = tokenizer(text)

# Print token IDs and tokens
print("Token IDs:", encoded["input_ids"])
print("Tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"]))

Comparing Performance on a Paragraph of Persian Text

The following image compares the performance of the PersianGemmaTokenizerFast on a paragraph of Persian text, showcasing its efficiency relative to other tokenizers (fewer tokens imply better performance):

Contributing

Contributions to improve the tokenizer or its documentation are welcome! If you encounter any issues or have suggestions, please feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License.