|
--- |
|
license: apache-2.0 |
|
base_model: Helsinki-NLP/opus-mt-sla-sla |
|
pipeline_tag: translation |
|
language: |
|
- pl |
|
- ru |
|
tags: |
|
- translation |
|
- polish-to-russian |
|
- slavic-languages |
|
--- |
|
|
|
# Model Card: 7-Sky/skyopus-pol-rus |
|
|
|
This model, `7-Sky/skyopus-pol-rus`, is a fine-tuned version of the `Helsinki-NLP/opus-mt-sla-sla` model, designed specifically for translating text from **Polish (pl)** to **Russian (ru)**. It is based on the Transformer architecture and uses normalization and SentencePiece tokenization (spm32k) for preprocessing. |
|
|
|
## Model Details |
|
|
|
- **Source Language**: Polish (`pol`) |
|
- **Target Language**: Russian (`rus`) |
|
- **Base Model**: [Helsinki-NLP/opus-mt-sla-sla](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/sla-sla) |
|
- **Model Type**: Transformer |
|
- **Preprocessing**: Normalization + SentencePiece (spm32k, spm32k) |
|
- **Language Token**: Requires a sentence-initial token in the form `>>rus<<` to specify the target language. |
|
- **Training Date**: 2025-03-16 The model was fine-tuned on a corpus that includes: |
|
- **Training Datasets**: |
|
- Medical terminology (e.g., healthcare and clinical texts А-С ) |
|
- Dialogue-based texts (e.g., conversational Polish and Russian) |
|
- Phraseological units (e.g., idioms and fixed expressions) |
|
- Slang vocabulary (e.g., informal and colloquial language) |
|
- Proverbs and sayings (e.g., culturally specific expressions) |
|
|
|
This model is part of the broader `sla-sla` family, originally developed for translations between Slavic languages, but this variant is fine-tuned for the specific `pol -> rus` pair. |
|
|
|
## Benchmarks |
|
|
|
- **chrF2 Score**: 0.672 |
|
- **BLEU Score**: 47.6 |
|
- **Brevity Penalty**: 1.0 |
|
- **Reference Length**: 70,390 tokens |
|
|
|
These metrics reflect the model's performance on the Tatoeba-Challenge dataset for Slavic languages. |
|
|
|
## How to Use the Model |
|
|
|
Below is an example of how to use the model with the `transformers` library in Python. The code supports generating multiple translation variants using beam search. |
|
|
|
```python |
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
|
# Model name on Hugging Face Hub |
|
model_name = "7-Sky/skyopus-pol-rus" |
|
|
|
# Load the tokenizer and model |
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
model = MarianMTModel.from_pretrained(model_name) |
|
|
|
# Function to translate text from Polish to Russian |
|
def translate_text(source_text, num_translations=3): |
|
# Add the required language token for Russian |
|
text_with_token = ">>rus<< " + source_text |
|
|
|
# Tokenize the input text |
|
inputs = tokenizer(text_with_token, return_tensors="pt", padding=True) |
|
|
|
# Generate translations with multiple variants |
|
translated_tokens = model.generate( |
|
**inputs, |
|
num_return_sequences=num_translations, # Number of translation variants |
|
num_beams=num_translations, # Use beams for diversity |
|
max_length=512 # Limit output length |
|
) |
|
|
|
# Decode the translated tokens into readable text |
|
translations = [tokenizer.decode(tokens, skip_special_tokens=True) for tokens in translated_tokens] |
|
return translations |
|
|
|
# Main loop for text input and translation output |
|
print("Enter a Polish phrase to translate into Russian or !q to quit.") |
|
|
|
while True: |
|
# Get input phrase from the user |
|
source_text = input("Enter a phrase: ") |
|
|
|
# Check for the quit command |
|
if source_text == "!q": |
|
print("Exiting the program.") |
|
break |
|
|
|
# Translate the phrase with multiple variants |
|
translations = translate_text(source_text) |
|
|
|
if translations: |
|
# Output all translation variants |
|
for idx, translation in enumerate(translations, 1): |
|
print(f"Variant {idx}: {translation}") |
|
|
|
# Example Output: |
|
# Enter a Polish phrase to translate into Russian or !q to quit. |
|
# Enter a phrase: Powiedzieć a zrobić to nie to samo. |
|
# Variant 1: Сказать и сделать — не одно и то же. |
|
# Variant 2: Сказать и сделать — это не одно и то же. |
|
# Variant 3: Сказать и сделать — не то же самое. |
|
# |
|
# Enter a phrase: O jego propozycji nawet nie warto mówić. |
|
# Variant 1: О его предложении даже не стоит говорить. |
|
# Variant 2: О его предложении не стоит даже говорить. |
|
# Variant 3: О его предложении и говорить не стоит. |
|
|
|
``` |
|
## Dear users and language enthusiasts, |
|
|
|
Your support has always been the driving force behind innovation, and today, I’m excited to share how you can help take this project to the next level. Together, we’ve built a unique translation model using Marian, trained on a custom dataset that pushes the boundaries of language understanding. But this is just the beginning! |
|
|
|
To continue improving the model, expanding the dataset, and ensuring faster, more accurate translations, we need your help. Your contributions will go directly toward: |
|
|
|
Enhancing the dataset: Adding more diverse and high-quality data to make the model even smarter. |
|
|
|
Acquiring powerful hardware: Training advanced models requires serious computational power, and your support will help us access the resources needed to make this happen. |
|
|
|
Every contribution, no matter how small, brings us closer to a future where language barriers are a thing of the past. If you believe in this mission and want to see this project grow, consider supporting us by clicking the button below to Buy Me a Coffee. |
|
|
|
Your support isn’t just a donation—it’s an investment in the future of communication. Let’s build something extraordinary together! |
|
|
|
<a href="https://buycoffee.to/skyweb117" target="_blank"><img src="https://buycoffee.to/img/share-button-primary.png" style="width: 166px; height: 43px" alt="Postaw mi kawę na buycoffee.to"></a> |
|
|
|
|
|
|