tejagowda
/

t5-hebrew-translation

Text Classification

Model card Files Files and versions Community

t5-hebrew-translation / README.md

tejagowda's picture

Update README.md

35a8689 verified 2 months ago

|

history blame contribute delete

2.96 kB

	---
	language:
	- he
	- en
	pipeline_tag: text-classification
	tags:
	- transformer
	- tokenizer
	---
	---

	language:
	- he
	- en
	pipeline_tag: text-classification
	tags:
	- transformer
	- tokenizer

	---

	# Model Overview

	Model Name: T5 Hebrew-to-English Translation Tokenizer
	Model Type: Tokenizer for Transformer-based models
	Base Model: T5 (Text-to-Text Transfer Transformer)
	Preprocessing: Custom Tokenizer using SentencePieceBPETokenizer
	Training Data: Custom Hebrew-English dataset curated for translation tasks
	Intended Use: This tokenizer is intended for machine translation tasks, specifically Hebrew-to-English translations.

	## Model Description

	This tokenizer was trained on a Hebrew-to-English dataset using `SentencePieceBPETokenizer`. It is optimized for handling Hebrew text tokenization and can be paired with a Transformer model, such as T5, for sequence-to-sequence translation tasks. It handles preprocessing tasks like tokenization, padding, and truncation effectively.

	## Performance

	- Task: Hebrew-to-English Translation (Tokenizer only)
	- Dataset: A custom dataset containing parallel Hebrew-English sentences
	- Metrics:
	- Vocabulary size: 30,000 tokens
	- Tokenization accuracy: Not applicable (Tokenizer-specific metric)

	## Usage

	### How to Use the Tokenizer

	To use this tokenizer, you can load it using the Hugging Face Transformers library:

	```python
	from transformers import AutoTokenizer

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False)

	# Example: Tokenizing a Hebrew sentence
	hebrew_text = "\u05D0\u05EA\u05D4\u05D3 \u05E2\u05DC \u05D4\u05D7\u05D5\u05DE\u05E8\u05D4."
	inputs = tokenizer(hebrew_text, return_tensors="pt")

	print("Tokens:", inputs["input_ids"])
	```

	### Example Usage with a Pretrained Model

	To perform translation, you can pair this tokenizer with a pretrained T5 model:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False)
	model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") # Replace with fine-tuned model if available

	# Hebrew text to translate
	hebrew_text = "\u05EA\u05D0\u05E8 \u05D0\u05EA \u05DE\u05D1\u05E0\u05D4 \u05E9\u05DC \u05D0\u05D8\u05D5\u05DD."

	# Tokenize and translate
	inputs = tokenizer(hebrew_text, return_tensors="pt")
	outputs = model.generate(inputs["input_ids"], max_length=100)

	# Decode the output
	english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print("Translation:", english_translation)
	```

	## Limitations

	- The tokenizer itself does not perform translation; it must be paired with a translation model.
	- Performance depends on the quality of the paired model and training data.

	## License

	This tokenizer is licensed under the Apache 2.0 License. See the LICENSE file for more details.