Update README.md
Browse files
README.md
CHANGED
@@ -1,63 +1,69 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
-
base_model:
|
4 |
-
- Helsinki-NLP/opus-mt-sla-sla
|
5 |
pipeline_tag: translation
|
6 |
language:
|
7 |
-
- pl
|
8 |
-
- ru
|
9 |
-
|
10 |
-
|
11 |
tags:
|
12 |
-
- translation
|
|
|
|
|
13 |
---
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
16 |
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
20 |
|
21 |
-
|
22 |
-
* source language(s): pol
|
23 |
-
* target language(s): rus
|
24 |
-
* model: transformer
|
25 |
-
* pre-processing: normalization + SentencePiece (spm32k,spm32k)
|
26 |
-
* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
|
27 |
-
* download original weights: [opus-2020-07-27.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.zip)
|
28 |
-
* test set translations: [opus-2020-07-27.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.test.txt)
|
29 |
-
* test set scores: [opus-2020-07-27.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.eval.txt)
|
30 |
|
31 |
## Benchmarks
|
32 |
|
|
|
|
|
|
|
|
|
33 |
|
|
|
34 |
|
|
|
35 |
|
36 |
-
|
37 |
|
38 |
```python
|
39 |
from transformers import MarianMTModel, MarianTokenizer
|
40 |
|
41 |
-
#
|
42 |
-
|
43 |
|
44 |
-
# Load the
|
45 |
-
tokenizer = MarianTokenizer.from_pretrained(
|
46 |
-
model = MarianMTModel.from_pretrained(
|
47 |
|
48 |
-
# Function to translate text
|
49 |
def translate_text(source_text, num_translations=3):
|
50 |
-
# Add the
|
51 |
text_with_token = ">>rus<< " + source_text
|
52 |
|
53 |
# Tokenize the input text
|
54 |
-
inputs = tokenizer(text_with_token, return_tensors="pt")
|
55 |
|
56 |
# Generate translations with multiple variants
|
57 |
translated_tokens = model.generate(
|
58 |
**inputs,
|
59 |
num_return_sequences=num_translations, # Number of translation variants
|
60 |
-
num_beams=num_translations
|
|
|
61 |
)
|
62 |
|
63 |
# Decode the translated tokens into readable text
|
@@ -65,7 +71,7 @@ def translate_text(source_text, num_translations=3):
|
|
65 |
return translations
|
66 |
|
67 |
# Main loop for text input and translation output
|
68 |
-
print("Enter a phrase to translate or !q to quit.")
|
69 |
|
70 |
while True:
|
71 |
# Get input phrase from the user
|
@@ -84,57 +90,17 @@ while True:
|
|
84 |
for idx, translation in enumerate(translations, 1):
|
85 |
print(f"Variant {idx}: {translation}")
|
86 |
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
# Variant 1: >>rus<< О его предложении и говорить не стоит.
|
99 |
-
|
100 |
-
```
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
### System Info:
|
105 |
-
- hf_name: sla-sla
|
106 |
-
|
107 |
-
- source_languages: sla
|
108 |
-
- target_languages: sla
|
109 |
-
- opus_readme_url: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/sla-sla/README.md
|
110 |
-
- original_repo: Tatoeba-Challenge
|
111 |
-
- tags: ['translation']
|
112 |
-
- languages: ['ru', 'pl']
|
113 |
-
- src_constituents: { 'pol'}
|
114 |
-
- tgt_constituents: {'rus'}
|
115 |
-
- src_multilingual: True
|
116 |
-
- tgt_multilingual: True
|
117 |
-
- prepro: normalization + SentencePiece (spm32k,spm32k)
|
118 |
-
- url_model: https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.zip
|
119 |
-
- url_test_set: https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.test.txt
|
120 |
-
- src_alpha3: sla
|
121 |
-
- tgt_alpha3: sla
|
122 |
-
- short_pair: sla-sla
|
123 |
-
- chrF2_score: 0.672
|
124 |
-
- bleu: 48.5
|
125 |
-
- brevity_penalty: 1.0
|
126 |
-
- ref_len: 59320.0
|
127 |
-
- src_name: Slavic languages
|
128 |
-
- tgt_name: Slavic languages
|
129 |
-
- train_date: 2020-07-27
|
130 |
-
- src_alpha2: sla
|
131 |
-
- tgt_alpha2: sla
|
132 |
-
- prefer_old: False
|
133 |
-
- long_pair: sla-sla
|
134 |
-
- helsinki_git_sha: 480fcbe0ee1bf4774bcbe6226ad9f58e63f6c535
|
135 |
-
- transformers_git_sha: 2207e5d8cb224e954a7cba69fa4ac2309e9ff30b
|
136 |
-
- port_machine: brutasse
|
137 |
-
- port_time: 2020-08-23-14:41
|
138 |
|
139 |
|
140 |
## Model Card Contact
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
base_model: Helsinki-NLP/opus-mt-sla-sla
|
|
|
4 |
pipeline_tag: translation
|
5 |
language:
|
6 |
+
- pl
|
7 |
+
- ru
|
|
|
|
|
8 |
tags:
|
9 |
+
- translation
|
10 |
+
- polish-to-russian
|
11 |
+
- slavic-languages
|
12 |
---
|
13 |
|
14 |
+
# Model Card: 7-Sky/skyopus-pol-rus
|
15 |
+
|
16 |
+
This model, `7-Sky/skyopus-pol-rus`, is a fine-tuned version of the `Helsinki-NLP/opus-mt-sla-sla` model, designed specifically for translating text from **Polish (pl)** to **Russian (ru)**. It is based on the Transformer architecture and uses normalization and SentencePiece tokenization (spm32k) for preprocessing.
|
17 |
+
|
18 |
+
## Model Details
|
19 |
|
20 |
+
- **Source Language**: Polish (`pol`)
|
21 |
+
- **Target Language**: Russian (`rus`)
|
22 |
+
- **Base Model**: [Helsinki-NLP/opus-mt-sla-sla](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/sla-sla)
|
23 |
+
- **Model Type**: Transformer
|
24 |
+
- **Preprocessing**: Normalization + SentencePiece (spm32k, spm32k)
|
25 |
+
- **Language Token**: Requires a sentence-initial token in the form `>>rus<<` to specify the target language.
|
26 |
+
- **Training Date**: 2020-07-27
|
27 |
|
28 |
+
This model is part of the broader `sla-sla` family, originally developed for translations between Slavic languages, but this variant is fine-tuned for the specific `pol -> rus` pair.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Benchmarks
|
31 |
|
32 |
+
- **chrF2 Score**: 0.672
|
33 |
+
- **BLEU Score**: 48.5
|
34 |
+
- **Brevity Penalty**: 1.0
|
35 |
+
- **Reference Length**: 59,320 tokens
|
36 |
|
37 |
+
These metrics reflect the model's performance on the Tatoeba-Challenge dataset for Slavic languages.
|
38 |
|
39 |
+
## How to Use the Model
|
40 |
|
41 |
+
Below is an example of how to use the model with the `transformers` library in Python. The code supports generating multiple translation variants using beam search.
|
42 |
|
43 |
```python
|
44 |
from transformers import MarianMTModel, MarianTokenizer
|
45 |
|
46 |
+
# Model name on Hugging Face Hub
|
47 |
+
model_name = "7-Sky/skyopus-pol-rus"
|
48 |
|
49 |
+
# Load the tokenizer and model
|
50 |
+
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
51 |
+
model = MarianMTModel.from_pretrained(model_name)
|
52 |
|
53 |
+
# Function to translate text from Polish to Russian
|
54 |
def translate_text(source_text, num_translations=3):
|
55 |
+
# Add the required language token for Russian
|
56 |
text_with_token = ">>rus<< " + source_text
|
57 |
|
58 |
# Tokenize the input text
|
59 |
+
inputs = tokenizer(text_with_token, return_tensors="pt", padding=True)
|
60 |
|
61 |
# Generate translations with multiple variants
|
62 |
translated_tokens = model.generate(
|
63 |
**inputs,
|
64 |
num_return_sequences=num_translations, # Number of translation variants
|
65 |
+
num_beams=num_translations, # Use beams for diversity
|
66 |
+
max_length=512 # Limit output length
|
67 |
)
|
68 |
|
69 |
# Decode the translated tokens into readable text
|
|
|
71 |
return translations
|
72 |
|
73 |
# Main loop for text input and translation output
|
74 |
+
print("Enter a Polish phrase to translate into Russian or !q to quit.")
|
75 |
|
76 |
while True:
|
77 |
# Get input phrase from the user
|
|
|
90 |
for idx, translation in enumerate(translations, 1):
|
91 |
print(f"Variant {idx}: {translation}")
|
92 |
|
93 |
+
# Example Output:
|
94 |
+
# Enter a Polish phrase to translate into Russian or !q to quit.
|
95 |
+
# Enter a phrase: Powiedzieć a zrobić to nie to samo.
|
96 |
+
# Variant 1: Сказать и сделать — не одно и то же.
|
97 |
+
# Variant 2: Сказать и сделать — это не одно и то же.
|
98 |
+
# Variant 3: Сказать и сделать — не то же самое.
|
99 |
+
#
|
100 |
+
# Enter a phrase: O jego propozycji nawet nie warto mówić.
|
101 |
+
# Variant 1: О его предложении даже не стоит говорить.
|
102 |
+
# Variant 2: О его предложении не стоит даже говорить.
|
103 |
+
# Variant 3: О его предложении и говорить не стоит.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
104 |
|
105 |
|
106 |
## Model Card Contact
|