|
--- |
|
language: |
|
- he |
|
- en |
|
pipeline_tag: text-classification |
|
tags: |
|
- transformer |
|
- tokenizer |
|
--- |
|
--- |
|
|
|
language: |
|
- he |
|
- en |
|
pipeline_tag: text-classification |
|
tags: |
|
- transformer |
|
- tokenizer |
|
|
|
--- |
|
|
|
# Model Overview |
|
|
|
**Model Name:** T5 Hebrew-to-English Translation Tokenizer |
|
**Model Type:** Tokenizer for Transformer-based models |
|
**Base Model:** T5 (Text-to-Text Transfer Transformer) |
|
**Preprocessing:** Custom Tokenizer using SentencePieceBPETokenizer |
|
**Training Data:** Custom Hebrew-English dataset curated for translation tasks |
|
**Intended Use:** This tokenizer is intended for machine translation tasks, specifically Hebrew-to-English translations. |
|
|
|
## Model Description |
|
|
|
This tokenizer was trained on a Hebrew-to-English dataset using `SentencePieceBPETokenizer`. It is optimized for handling Hebrew text tokenization and can be paired with a Transformer model, such as T5, for sequence-to-sequence translation tasks. It handles preprocessing tasks like tokenization, padding, and truncation effectively. |
|
|
|
## Performance |
|
|
|
- **Task:** Hebrew-to-English Translation (Tokenizer only) |
|
- **Dataset:** A custom dataset containing parallel Hebrew-English sentences |
|
- **Metrics:** |
|
- Vocabulary size: 30,000 tokens |
|
- Tokenization accuracy: Not applicable (Tokenizer-specific metric) |
|
|
|
## Usage |
|
|
|
### How to Use the Tokenizer |
|
|
|
To use this tokenizer, you can load it using the Hugging Face Transformers library: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
# Load the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False) |
|
|
|
# Example: Tokenizing a Hebrew sentence |
|
hebrew_text = "\u05D0\u05EA\u05D4\u05D3 \u05E2\u05DC \u05D4\u05D7\u05D5\u05DE\u05E8\u05D4." |
|
inputs = tokenizer(hebrew_text, return_tensors="pt") |
|
|
|
print("Tokens:", inputs["input_ids"]) |
|
``` |
|
|
|
### Example Usage with a Pretrained Model |
|
|
|
To perform translation, you can pair this tokenizer with a pretrained T5 model: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False) |
|
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") # Replace with fine-tuned model if available |
|
|
|
# Hebrew text to translate |
|
hebrew_text = "\u05EA\u05D0\u05E8 \u05D0\u05EA \u05DE\u05D1\u05E0\u05D4 \u05E9\u05DC \u05D0\u05D8\u05D5\u05DD." |
|
|
|
# Tokenize and translate |
|
inputs = tokenizer(hebrew_text, return_tensors="pt") |
|
outputs = model.generate(inputs["input_ids"], max_length=100) |
|
|
|
# Decode the output |
|
english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
print("Translation:", english_translation) |
|
``` |
|
|
|
## Limitations |
|
|
|
- The tokenizer itself does not perform translation; it must be paired with a translation model. |
|
- Performance depends on the quality of the paired model and training data. |
|
|
|
## License |
|
|
|
This tokenizer is licensed under the Apache 2.0 License. See the LICENSE file for more details. |
|
|
|
|