skyopus-pol-rus / README.md

Update README.md

2f042bc verified 1 day ago

5.85 kB

	---
	license: apache-2.0
	base_model: Helsinki-NLP/opus-mt-sla-sla
	pipeline_tag: translation
	language:
	- pl
	- ru
	tags:
	- translation
	- polish-to-russian
	- slavic-languages
	---

	# Model Card: 7-Sky/skyopus-pol-rus

	This model, `7-Sky/skyopus-pol-rus`, is a fine-tuned version of the `Helsinki-NLP/opus-mt-sla-sla` model, designed specifically for translating text from Polish (pl) to Russian (ru). It is based on the Transformer architecture and uses normalization and SentencePiece tokenization (spm32k) for preprocessing.

	## Model Details

	- Source Language: Polish (`pol`)
	- Target Language: Russian (`rus`)
	- Base Model: [Helsinki-NLP/opus-mt-sla-sla](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/sla-sla)
	- Model Type: Transformer
	- Preprocessing: Normalization + SentencePiece (spm32k, spm32k)
	- Language Token: Requires a sentence-initial token in the form `>>rus<<` to specify the target language.
	- Training Date: 2025-03-16 The model was fine-tuned on a corpus that includes:
	- Training Datasets:
	- Medical terminology (e.g., healthcare and clinical texts А-С )
	- Dialogue-based texts (e.g., conversational Polish and Russian)
	- Phraseological units (e.g., idioms and fixed expressions)
	- Slang vocabulary (e.g., informal and colloquial language)
	- Proverbs and sayings (e.g., culturally specific expressions)

	This model is part of the broader `sla-sla` family, originally developed for translations between Slavic languages, but this variant is fine-tuned for the specific `pol -> rus` pair.

	## Benchmarks

	- chrF2 Score: 0.672
	- BLEU Score: 47.6
	- Brevity Penalty: 1.0
	- Reference Length: 70,390 tokens

	These metrics reflect the model's performance on the Tatoeba-Challenge dataset for Slavic languages.

	## How to Use the Model

	Below is an example of how to use the model with the `transformers` library in Python. The code supports generating multiple translation variants using beam search.

	```python
	from transformers import MarianMTModel, MarianTokenizer

	# Model name on Hugging Face Hub
	model_name = "7-Sky/skyopus-pol-rus"

	# Load the tokenizer and model
	tokenizer = MarianTokenizer.from_pretrained(model_name)
	model = MarianMTModel.from_pretrained(model_name)

	# Function to translate text from Polish to Russian
	def translate_text(source_text, num_translations=3):
	# Add the required language token for Russian
	text_with_token = ">>rus<< " + source_text

	# Tokenize the input text
	inputs = tokenizer(text_with_token, return_tensors="pt", padding=True)

	# Generate translations with multiple variants
	translated_tokens = model.generate(
	**inputs,
	num_return_sequences=num_translations, # Number of translation variants
	num_beams=num_translations, # Use beams for diversity
	max_length=512 # Limit output length
	)

	# Decode the translated tokens into readable text
	translations = [tokenizer.decode(tokens, skip_special_tokens=True) for tokens in translated_tokens]
	return translations

	# Main loop for text input and translation output
	print("Enter a Polish phrase to translate into Russian or !q to quit.")

	while True:
	# Get input phrase from the user
	source_text = input("Enter a phrase: ")

	# Check for the quit command
	if source_text == "!q":
	print("Exiting the program.")
	break

	# Translate the phrase with multiple variants
	translations = translate_text(source_text)

	if translations:
	# Output all translation variants
	for idx, translation in enumerate(translations, 1):
	print(f"Variant {idx}: {translation}")

	# Example Output:
	# Enter a Polish phrase to translate into Russian or !q to quit.
	# Enter a phrase: Powiedzieć a zrobić to nie to samo.
	# Variant 1: Сказать и сделать — не одно и то же.
	# Variant 2: Сказать и сделать — это не одно и то же.
	# Variant 3: Сказать и сделать — не то же самое.
	#
	# Enter a phrase: O jego propozycji nawet nie warto mówić.
	# Variant 1: О его предложении даже не стоит говорить.
	# Variant 2: О его предложении не стоит даже говорить.
	# Variant 3: О его предложении и говорить не стоит.

	```
	## Dear users and language enthusiasts,

	Your support has always been the driving force behind innovation, and today, I’m excited to share how you can help take this project to the next level. Together, we’ve built a unique translation model using Marian, trained on a custom dataset that pushes the boundaries of language understanding. But this is just the beginning!

	To continue improving the model, expanding the dataset, and ensuring faster, more accurate translations, we need your help. Your contributions will go directly toward:

	Enhancing the dataset: Adding more diverse and high-quality data to make the model even smarter.

	Acquiring powerful hardware: Training advanced models requires serious computational power, and your support will help us access the resources needed to make this happen.

	Every contribution, no matter how small, brings us closer to a future where language barriers are a thing of the past. If you believe in this mission and want to see this project grow, consider supporting us by clicking the button below to Buy Me a Coffee.

	Your support isn’t just a donation—it’s an investment in the future of communication. Let’s build something extraordinary together!

	<a href="https://buycoffee.to/skyweb117" target="_blank"><img src="https://buycoffee.to/img/share-button-primary.png" style="width: 166px; height: 43px" alt="Postaw mi kawę na buycoffee.to"></a>