guli111 commited on
Commit
92f7ee9
·
verified ·
1 Parent(s): 6ec5be3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ru
5
+ - kz
6
+
7
+ tags:
8
+ - translation
9
+ - wmt19
10
+ - facebook
11
+ license: apache-2.0
12
+ datasets:
13
+ - wmt19
14
+ metrics:
15
+ - bleu
16
+ thumbnail: https://huggingface.co/front/thumbnails/facebook.png
17
+ ---
18
+
19
+ # FSMT
20
+
21
+ ## Model description
22
+
23
+ This is a ported version of [fairseq wmt19 transformer](https://github.com/pytorch/fairseq/blob/master/examples/wmt19/README.md) for en-ru.
24
+
25
+ For more details, please see, [Facebook FAIR's WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616).
26
+
27
+ The abbreviation FSMT stands for FairSeqMachineTranslation
28
+
29
+ All four models are available:
30
+
31
+ * [wmt19-en-ru](https://huggingface.co/facebook/wmt19-en-ru)
32
+ * [wmt19-ru-en](https://huggingface.co/facebook/wmt19-ru-en)
33
+ * [wmt19-en-de](https://huggingface.co/facebook/wmt19-en-de)
34
+ * [wmt19-de-en](https://huggingface.co/facebook/wmt19-de-en)
35
+
36
+ ## Intended uses & limitations
37
+
38
+ #### How to use
39
+
40
+ ```python
41
+ from transformers import FSMTForConditionalGeneration, FSMTTokenizer
42
+ mname = "facebook/wmt19-en-ru"
43
+ tokenizer = FSMTTokenizer.from_pretrained(mname)
44
+ model = FSMTForConditionalGeneration.from_pretrained(mname)
45
+
46
+ input = "Machine learning is great, isn't it?"
47
+ input_ids = tokenizer.encode(input, return_tensors="pt")
48
+ outputs = model.generate(input_ids)
49
+ decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
50
+ print(decoded) # Машинное обучение - это здорово, не так ли?
51
+
52
+ ```
53
+
54
+ #### Limitations and bias
55
+
56
+ - The original (and this ported model) doesn't seem to handle well inputs with repeated sub-phrases, [content gets truncated](https://discuss.huggingface.co/t/issues-with-translating-inputs-containing-repeated-phrases/981)
57
+
58
+ ## Training data
59
+
60
+ Pretrained weights were left identical to the original model released by fairseq. For more details, please, see the [paper](https://arxiv.org/abs/1907.06616).
61
+
62
+ ## Eval results
63
+
64
+ pair | fairseq | transformers
65
+ -------|---------|----------
66
+ en-ru | [36.4](http://matrix.statmt.org/matrix/output/1914?run_id=6724) | 33.47
67
+
68
+ The score is slightly below the score reported by `fairseq`, since `transformers`` currently doesn't support:
69
+ - model ensemble, therefore the best performing checkpoint was ported (``model4.pt``).
70
+ - re-ranking
71
+
72
+ The score was calculated using this code:
73
+
74
+ ```bash
75
+ git clone https://github.com/huggingface/transformers
76
+ cd transformers
77
+ export PAIR=en-ru
78
+ export DATA_DIR=data/$PAIR
79
+ export SAVE_DIR=data/$PAIR
80
+ export BS=8
81
+ export NUM_BEAMS=15
82
+ mkdir -p $DATA_DIR
83
+ sacrebleu -t wmt19 -l $PAIR --echo src > $DATA_DIR/val.source
84
+ sacrebleu -t wmt19 -l $PAIR --echo ref > $DATA_DIR/val.target
85
+ echo $PAIR
86
+ PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py facebook/wmt19-$PAIR $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
87
+ ```
88
+ note: fairseq reports using a beam of 50, so you should get a slightly higher score if re-run with `--num_beams 50`.
89
+
90
+ ## Data Sources
91
+
92
+ - [training, etc.](http://www.statmt.org/wmt19/)
93
+ - [test set](http://matrix.statmt.org/test_sets/newstest2019.tgz?1556572561)
94
+
95
+
96
+ ### BibTeX entry and citation info
97
+
98
+ ```bibtex
99
+ @inproceedings{...,
100
+ year={2020},
101
+ title={Facebook FAIR's WMT19 News Translation Task Submission},
102
+ author={Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
103
+ booktitle={Proc. of WMT},
104
+ }
105
+ ```
106
+
107
+
108
+ ## TODO
109
+
110
+ - port model ensemble (fairseq uses 4 model checkpoints)