Model Card: 7-Sky/skyopus-pol-rus

This model, 7-Sky/skyopus-pol-rus, is a fine-tuned version of the Helsinki-NLP/opus-mt-sla-sla model, designed specifically for translating text from Polish (pl) to Russian (ru). It is based on the Transformer architecture and uses normalization and SentencePiece tokenization (spm32k) for preprocessing.

Model Details

  • Source Language: Polish (pol)
  • Target Language: Russian (rus)
  • Base Model: Helsinki-NLP/opus-mt-sla-sla
  • Model Type: Transformer
  • Preprocessing: Normalization + SentencePiece (spm32k, spm32k)
  • Language Token: Requires a sentence-initial token in the form >>rus<< to specify the target language.
  • Training Date: 2025-03-10 The model was fine-tuned on a corpus that includes:
  • Training Datasets:
    • Medical terminology (e.g., healthcare and clinical texts)
    • Dialogue-based texts (e.g., conversational Polish and Russian)
    • Phraseological units (e.g., idioms and fixed expressions)
    • Slang vocabulary (e.g., informal and colloquial language)
    • Proverbs and sayings (e.g., culturally specific expressions)

This model is part of the broader sla-sla family, originally developed for translations between Slavic languages, but this variant is fine-tuned for the specific pol -> rus pair.

Benchmarks

  • chrF2 Score: 0.672
  • BLEU Score: 47.6
  • Brevity Penalty: 1.0
  • Reference Length: 59,320 tokens

These metrics reflect the model's performance on the Tatoeba-Challenge dataset for Slavic languages.

How to Use the Model

Below is an example of how to use the model with the transformers library in Python. The code supports generating multiple translation variants using beam search.

from transformers import MarianMTModel, MarianTokenizer

# Model name on Hugging Face Hub
model_name = "7-Sky/skyopus-pol-rus"

# Load the tokenizer and model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Function to translate text from Polish to Russian
def translate_text(source_text, num_translations=3):
    # Add the required language token for Russian
    text_with_token = ">>rus<< " + source_text

    # Tokenize the input text
    inputs = tokenizer(text_with_token, return_tensors="pt", padding=True)

    # Generate translations with multiple variants
    translated_tokens = model.generate(
        **inputs,
        num_return_sequences=num_translations,  # Number of translation variants
        num_beams=num_translations,            # Use beams for diversity
        max_length=512                         # Limit output length
    )

    # Decode the translated tokens into readable text
    translations = [tokenizer.decode(tokens, skip_special_tokens=True) for tokens in translated_tokens]
    return translations

# Main loop for text input and translation output
print("Enter a Polish phrase to translate into Russian or !q to quit.")

while True:
    # Get input phrase from the user
    source_text = input("Enter a phrase: ")
    
    # Check for the quit command
    if source_text == "!q":
        print("Exiting the program.")
        break
    
    # Translate the phrase with multiple variants
    translations = translate_text(source_text)
    
    if translations:
        # Output all translation variants
        for idx, translation in enumerate(translations, 1):
            print(f"Variant {idx}: {translation}")

# Example Output:
# Enter a Polish phrase to translate into Russian or !q to quit.
# Enter a phrase: Powiedzieć a zrobić to nie to samo.
# Variant 1: Сказать и сделать — не одно и то же.
# Variant 2: Сказать и сделать — это не одно и то же.
# Variant 3: Сказать и сделать — не то же самое.
#
# Enter a phrase: O jego propozycji nawet nie warto mówić.
# Variant 1: О его предложении даже не стоит говорить.
# Variant 2: О его предложении не стоит даже говорить.
# Variant 3: О его предложении и говорить не стоит.

Dear users and language enthusiasts,

Your support has always been the driving force behind innovation, and today, I’m excited to share how you can help take this project to the next level. Together, we’ve built a unique translation model using Marian, trained on a custom dataset that pushes the boundaries of language understanding. But this is just the beginning!

To continue improving the model, expanding the dataset, and ensuring faster, more accurate translations, we need your help. Your contributions will go directly toward:

Enhancing the dataset: Adding more diverse and high-quality data to make the model even smarter.

Acquiring powerful hardware: Training advanced models requires serious computational power, and your support will help us access the resources needed to make this happen.

Every contribution, no matter how small, brings us closer to a future where language barriers are a thing of the past. If you believe in this mission and want to see this project grow, consider supporting us by clicking the button below to Buy Me a Coffee.

Your support isn’t just a donation—it’s an investment in the future of communication. Let’s build something extraordinary together!

Postaw mi kawę na buycoffee.to

Downloads last month
8
Safetensors
Model size
63.7M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for 7-Sky/skyopus-pol-rus

Finetuned
(1)
this model