7-Sky commited on
Commit
e993bf2
·
verified ·
1 Parent(s): 4143a94

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -82
README.md CHANGED
@@ -1,63 +1,69 @@
1
  ---
2
  license: apache-2.0
3
- base_model:
4
- - Helsinki-NLP/opus-mt-sla-sla
5
  pipeline_tag: translation
6
  language:
7
- - pl
8
- - ru
9
-
10
-
11
  tags:
12
- - translation
 
 
13
  ---
14
 
15
- ### sla-sla
 
 
 
 
16
 
17
- * source group: Slavic languages
18
- * target group: Slavic languages
19
- * OPUS readme: [sla-sla](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/sla-sla/README.md)
 
 
 
 
20
 
21
- * model: transformer
22
- * source language(s): pol
23
- * target language(s): rus
24
- * model: transformer
25
- * pre-processing: normalization + SentencePiece (spm32k,spm32k)
26
- * a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)
27
- * download original weights: [opus-2020-07-27.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.zip)
28
- * test set translations: [opus-2020-07-27.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.test.txt)
29
- * test set scores: [opus-2020-07-27.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.eval.txt)
30
 
31
  ## Benchmarks
32
 
 
 
 
 
33
 
 
34
 
 
35
 
36
- ## Run the model
37
 
38
  ```python
39
  from transformers import MarianMTModel, MarianTokenizer
40
 
41
- # Paths to the model and tokenizer
42
- model_path = "7-Sky/skyopus-pol-rus"
43
 
44
- # Load the model and tokenizer
45
- tokenizer = MarianTokenizer.from_pretrained(model_path)
46
- model = MarianMTModel.from_pretrained(model_path)
47
 
48
- # Function to translate text with multiple variants (Russian only)
49
  def translate_text(source_text, num_translations=3):
50
- # Add the fixed language token for Russian
51
  text_with_token = ">>rus<< " + source_text
52
 
53
  # Tokenize the input text
54
- inputs = tokenizer(text_with_token, return_tensors="pt")
55
 
56
  # Generate translations with multiple variants
57
  translated_tokens = model.generate(
58
  **inputs,
59
  num_return_sequences=num_translations, # Number of translation variants
60
- num_beams=num_translations # Use multiple beams for diversity
 
61
  )
62
 
63
  # Decode the translated tokens into readable text
@@ -65,7 +71,7 @@ def translate_text(source_text, num_translations=3):
65
  return translations
66
 
67
  # Main loop for text input and translation output
68
- print("Enter a phrase to translate or !q to quit.")
69
 
70
  while True:
71
  # Get input phrase from the user
@@ -84,57 +90,17 @@ while True:
84
  for idx, translation in enumerate(translations, 1):
85
  print(f"Variant {idx}: {translation}")
86
 
87
- # Output
88
- # Enter a phrase to translate or !q to quit.
89
- # Enter a phrase: Powiedzieć a zrobić to nie to samo.
90
- # Variant 1: >>rus<< Сказать и сделать - не одно и то же.
91
- # Variant 2: >>rus<< Сказать и сделать — не одно и то же.
92
- # Variant 3: >>rus<< Сказать и сделать - не то же самое.
93
-
94
- # Enter a phrase to translate or !q to quit.
95
- # Enter a phrase: O jego propozycji nawet nie warto mówić.
96
- # Variant 1: >>rus<< О его предложении не стоит даже говорить.
97
- # Variant 1: >>rus<< О его предложении даже не стоит говорить.
98
- # Variant 1: >>rus<< О его предложении и говорить не стоит.
99
-
100
- ```
101
-
102
-
103
-
104
- ### System Info:
105
- - hf_name: sla-sla
106
-
107
- - source_languages: sla
108
- - target_languages: sla
109
- - opus_readme_url: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/sla-sla/README.md
110
- - original_repo: Tatoeba-Challenge
111
- - tags: ['translation']
112
- - languages: ['ru', 'pl']
113
- - src_constituents: { 'pol'}
114
- - tgt_constituents: {'rus'}
115
- - src_multilingual: True
116
- - tgt_multilingual: True
117
- - prepro: normalization + SentencePiece (spm32k,spm32k)
118
- - url_model: https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.zip
119
- - url_test_set: https://object.pouta.csc.fi/Tatoeba-MT-models/sla-sla/opus-2020-07-27.test.txt
120
- - src_alpha3: sla
121
- - tgt_alpha3: sla
122
- - short_pair: sla-sla
123
- - chrF2_score: 0.672
124
- - bleu: 48.5
125
- - brevity_penalty: 1.0
126
- - ref_len: 59320.0
127
- - src_name: Slavic languages
128
- - tgt_name: Slavic languages
129
- - train_date: 2020-07-27
130
- - src_alpha2: sla
131
- - tgt_alpha2: sla
132
- - prefer_old: False
133
- - long_pair: sla-sla
134
- - helsinki_git_sha: 480fcbe0ee1bf4774bcbe6226ad9f58e63f6c535
135
- - transformers_git_sha: 2207e5d8cb224e954a7cba69fa4ac2309e9ff30b
136
- - port_machine: brutasse
137
- - port_time: 2020-08-23-14:41
138
 
139
 
140
  ## Model Card Contact
 
1
  ---
2
  license: apache-2.0
3
+ base_model: Helsinki-NLP/opus-mt-sla-sla
 
4
  pipeline_tag: translation
5
  language:
6
+ - pl
7
+ - ru
 
 
8
  tags:
9
+ - translation
10
+ - polish-to-russian
11
+ - slavic-languages
12
  ---
13
 
14
+ # Model Card: 7-Sky/skyopus-pol-rus
15
+
16
+ This model, `7-Sky/skyopus-pol-rus`, is a fine-tuned version of the `Helsinki-NLP/opus-mt-sla-sla` model, designed specifically for translating text from **Polish (pl)** to **Russian (ru)**. It is based on the Transformer architecture and uses normalization and SentencePiece tokenization (spm32k) for preprocessing.
17
+
18
+ ## Model Details
19
 
20
+ - **Source Language**: Polish (`pol`)
21
+ - **Target Language**: Russian (`rus`)
22
+ - **Base Model**: [Helsinki-NLP/opus-mt-sla-sla](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/sla-sla)
23
+ - **Model Type**: Transformer
24
+ - **Preprocessing**: Normalization + SentencePiece (spm32k, spm32k)
25
+ - **Language Token**: Requires a sentence-initial token in the form `>>rus<<` to specify the target language.
26
+ - **Training Date**: 2020-07-27
27
 
28
+ This model is part of the broader `sla-sla` family, originally developed for translations between Slavic languages, but this variant is fine-tuned for the specific `pol -> rus` pair.
 
 
 
 
 
 
 
 
29
 
30
  ## Benchmarks
31
 
32
+ - **chrF2 Score**: 0.672
33
+ - **BLEU Score**: 48.5
34
+ - **Brevity Penalty**: 1.0
35
+ - **Reference Length**: 59,320 tokens
36
 
37
+ These metrics reflect the model's performance on the Tatoeba-Challenge dataset for Slavic languages.
38
 
39
+ ## How to Use the Model
40
 
41
+ Below is an example of how to use the model with the `transformers` library in Python. The code supports generating multiple translation variants using beam search.
42
 
43
  ```python
44
  from transformers import MarianMTModel, MarianTokenizer
45
 
46
+ # Model name on Hugging Face Hub
47
+ model_name = "7-Sky/skyopus-pol-rus"
48
 
49
+ # Load the tokenizer and model
50
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
51
+ model = MarianMTModel.from_pretrained(model_name)
52
 
53
+ # Function to translate text from Polish to Russian
54
  def translate_text(source_text, num_translations=3):
55
+ # Add the required language token for Russian
56
  text_with_token = ">>rus<< " + source_text
57
 
58
  # Tokenize the input text
59
+ inputs = tokenizer(text_with_token, return_tensors="pt", padding=True)
60
 
61
  # Generate translations with multiple variants
62
  translated_tokens = model.generate(
63
  **inputs,
64
  num_return_sequences=num_translations, # Number of translation variants
65
+ num_beams=num_translations, # Use beams for diversity
66
+ max_length=512 # Limit output length
67
  )
68
 
69
  # Decode the translated tokens into readable text
 
71
  return translations
72
 
73
  # Main loop for text input and translation output
74
+ print("Enter a Polish phrase to translate into Russian or !q to quit.")
75
 
76
  while True:
77
  # Get input phrase from the user
 
90
  for idx, translation in enumerate(translations, 1):
91
  print(f"Variant {idx}: {translation}")
92
 
93
+ # Example Output:
94
+ # Enter a Polish phrase to translate into Russian or !q to quit.
95
+ # Enter a phrase: Powiedzieć a zrobić to nie to samo.
96
+ # Variant 1: Сказать и сделать не одно и то же.
97
+ # Variant 2: Сказать и сделать — это не одно и то же.
98
+ # Variant 3: Сказать и сделать не то же самое.
99
+ #
100
+ # Enter a phrase: O jego propozycji nawet nie warto mówić.
101
+ # Variant 1: О его предложении даже не стоит говорить.
102
+ # Variant 2: О его предложении не стоит даже говорить.
103
+ # Variant 3: О его предложении и говорить не стоит.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
 
106
  ## Model Card Contact