--- library_name: transformers tags: [] --- # PersianGemmaTokenizerFast A fine-tuned Gemma tokenizer on Persian text, optimized to handle the nuances of the Persian language with improved efficiency and accuracy. This tokenizer is available via the Hugging Face Hub as [mshojaei77/PersianGemmaTokenizerFast](https://huggingface.co/mshojaei77/PersianGemmaTokenizerFast). ## Overview The **PersianGemmaTokenizerFast** leverages the robust architecture of the original Gemma tokenizer and is fine-tuned on Persian data. It is designed to provide faster and more accurate tokenization for various Natural Language Processing (NLP) tasks involving Persian text. ## Features - **Optimized for Persian:** Tailored tokenization for Persian language constructs. - **Speed and Efficiency:** Built on fast tokenization libraries for quick processing. - **Compatibility:** Works seamlessly with the Hugging Face Transformers library. ## Usage Here is an example of how to use the tokenizer in your Python code: ```python from transformers import AutoTokenizer # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianGemmaTokenizerFast") # Example Persian text text = "سلام، حال شما چطور است؟" # Tokenize the text encoded = tokenizer(text) # Print token IDs and tokens print("Token IDs:", encoded["input_ids"]) print("Tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"])) ``` ## Comparing Performance on a Paragraph of Persian Text The following image compares the performance of the PersianGemmaTokenizerFast on a paragraph of Persian text, showcasing its efficiency relative to other tokenizers (fewer tokens imply better performance): ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6556b1bb85d43542fa1a8f91/lZJKqsi4BZ8mJiY_I-vhA.png) ## Contributing Contributions to improve the tokenizer or its documentation are welcome! If you encounter any issues or have suggestions, please feel free to open an issue or submit a pull request. ## License This project is licensed under the MIT License.