library_name: transformers
tags: []
PersianGemmaTokenizerFast
A fine-tuned Gemma tokenizer on Persian text, optimized to handle the nuances of the Persian language with improved efficiency and accuracy. This tokenizer is available via the Hugging Face Hub as mshojaei77/PersianGemmaTokenizerFast.
Overview
The PersianGemmaTokenizerFast leverages the robust architecture of the original Gemma tokenizer and is fine-tuned on Persian data. It is designed to provide faster and more accurate tokenization for various Natural Language Processing (NLP) tasks involving Persian text.
Features
- Optimized for Persian: Tailored tokenization for Persian language constructs.
- Speed and Efficiency: Built on fast tokenization libraries for quick processing.
- Compatibility: Works seamlessly with the Hugging Face Transformers library.
Usage
Here is an example of how to use the tokenizer in your Python code:
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianGemmaTokenizerFast")
# Example Persian text
text = "سلام، حال شما چطور است؟"
# Tokenize the text
encoded = tokenizer(text)
# Print token IDs and tokens
print("Token IDs:", encoded["input_ids"])
print("Tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"]))
Comparing Performance on a Paragraph of Persian Text
The following image compares the performance of the PersianGemmaTokenizerFast on a paragraph of Persian text, showcasing its efficiency relative to other tokenizers (fewer tokens imply better performance):
Contributing
Contributions to improve the tokenizer or its documentation are welcome! If you encounter any issues or have suggestions, please feel free to open an issue or submit a pull request.
License
This project is licensed under the MIT License.