Mana Tokenizer

The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text.

Quick Start

You can encode/decode your data using Mana Tokenizer like this:

from mana_tokenizer import ManaTokenizer
tokenizer = ManaTokenizer()
text = "سلام من یک متن تست برای تست این تست هستم."
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))

this is the normal encoding of this text:

[216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46]
سلام من یک متن تست برای تست این تست هستم.

and here is what Mana tokenizer generate:

[30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46]
سلام من یک متن تست برای تست این تست هستم.

You can also add special tokens:

tokenizer.register_special_tokens({"</s>": 100269})

Batch encode:

tokenizer.batch_encode(["یک متن طولانی"])

Benchmark

  • Benchmark DateTime: 2024-11-06 16:12:50
  • Mana Batch Encode Time: 0.10711932182312012 seconds
  • Mana Batch Encode Memory Usage: 13.203125 KB
  • Total characters in benchmark: 131,000

Special Tokens

  • user Token: <|user|>
  • assistant Token: <|assistant|>
  • end Token: <|end|>
  • system Token: <|system|>

Statistics

  • Model Type: BPE
  • Vocabulary Size: 265,703
  • Character Coverage: 99.9%
  • Total Number of Text Samples: 1,147,036
  • Total Number of Tokens: 1,490,338
  • Average Token Length: 4.51
  • Corpus Size (in bytes): 1,792,210,410

Training Details

  • Training Data: Mana Persian corpus
  • Training Script: Mana Trainer
  • Script Version: 1.2

License

Mana tokenizer is licensed under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.