Spaces:
Runtime error
Runtime error
File size: 10,160 Bytes
8437114 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
[[Back]](..)
# S2T Example: Speech Translation (ST) on Multilingual TEDx
[Multilingual TEDx](https://arxiv.org/abs/2102.01757) is multilingual corpus for speech recognition and
speech translation. The data is derived from TEDx talks in 8 source languages
with translations to a subset of 5 target languages.
## Data Preparation
[Download](http://openslr.org/100/) and unpack Multilingual TEDx data to a path
`${MTEDX_ROOT}/${LANG_PAIR}`, then preprocess it with
```bash
# additional Python packages for S2T data processing/model training
pip install pandas torchaudio soundfile sentencepiece
# Generate TSV manifests, features, vocabulary
# and configuration for each language
python examples/speech_to_text/prep_mtedx_data.py \
--data-root ${MTEDX_ROOT} --task asr \
--vocab-type unigram --vocab-size 1000
python examples/speech_to_text/prep_mtedx_data.py \
--data-root ${MTEDX_ROOT} --task st \
--vocab-type unigram --vocab-size 1000
# Add vocabulary and configuration for joint data
# (based on the manifests and features generated above)
python examples/speech_to_text/prep_mtedx_data.py \
--data-root ${MTEDX_ROOT} --task asr --joint \
--vocab-type unigram --vocab-size 8000
python examples/speech_to_text/prep_mtedx_data.py \
--data-root ${MTEDX_ROOT} --task st --joint \
--vocab-type unigram --vocab-size 8000
```
The generated files (manifest, features, vocabulary and data configuration) will be added to
`${MTEDX_ROOT}/${LANG_PAIR}` (per-language data) and `MTEDX_ROOT` (joint data).
## ASR
#### Training
Spanish as example:
```bash
fairseq-train ${MTEDX_ROOT}/es-es \
--config-yaml config_asr.yaml --train-subset train_asr --valid-subset valid_asr \
--save-dir ${ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-epoch 200 \
--task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
--arch s2t_transformer_xs --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --dropout 0.3 --label-smoothing 0.1 \
--load-pretrained-encoder-from ${PRETRAINED_ENCODER} \
--skip-invalid-size-inputs-valid-test \
--keep-last-epochs 10 --update-freq 8 --patience 10
```
For joint model (using ASR data from all 8 languages):
```bash
fairseq-train ${MTEDX_ROOT} \
--config-yaml config_asr.yaml \
--train-subset train_es-es_asr,train_fr-fr_asr,train_pt-pt_asr,train_it-it_asr,train_ru-ru_asr,train_el-el_asr,train_ar-ar_asr,train_de-de_asr \
--valid-subset valid_es-es_asr,valid_fr-fr_asr,valid_pt-pt_asr,valid_it-it_asr,valid_ru-ru_asr,valid_el-el_asr,valid_ar-ar_asr,valid_de-de_asr \
--save-dir ${MULTILINGUAL_ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-epoch 200 \
--task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
--arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --dropout 0.3 --label-smoothing 0.1 \
--skip-invalid-size-inputs-valid-test \
--keep-last-epochs 10 --update-freq 8 --patience 10 \
--ignore-prefix-size 1
```
where `MULTILINGUAL_ASR_SAVE_DIR` is the checkpoint root path. We set `--update-freq 8` to simulate 8 GPUs
with 1 GPU. You may want to update it accordingly when using more than 1 GPU.
For multilingual models, we prepend target language ID token as target BOS, which should be excluded from
the training loss via `--ignore-prefix-size 1`.
#### Inference & Evaluation
```bash
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
--inputs ${ASR_SAVE_DIR} --num-epoch-checkpoints 10 \
--output "${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${MTEDX_ROOT}/es-es \
--config-yaml config_asr.yaml --gen-subset test --task speech_to_text \
--path ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 \
--skip-invalid-size-inputs-valid-test \
--scoring wer --wer-tokenizer 13a --wer-lowercase --wer-remove-punct --remove-bpe
# For models trained on joint data
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
--inputs ${MULTILINGUAL_ASR_SAVE_DIR} --num-epoch-checkpoints 10 \
--output "${MULTILINGUAL_ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}"
for LANG in es fr pt it ru el ar de; do
fairseq-generate ${MTEDX_ROOT} \
--config-yaml config_asr.yaml --gen-subset test_${LANG}-${LANG}_asr --task speech_to_text \
--prefix-size 1 --path ${MULTILINGUAL_ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} \
--max-tokens 40000 --beam 5 \
--skip-invalid-size-inputs-valid-test \
--scoring wer --wer-tokenizer 13a --wer-lowercase --wer-remove-punct --remove-bpe
done
```
#### Results
| Data | --arch | Params | Es | Fr | Pt | It | Ru | El | Ar | De |
|--------------|--------------------|--------|------|------|------|------|------|-------|-------|-------|
| Monolingual | s2t_transformer_xs | 10M | 46.4 | 45.6 | 54.8 | 48.0 | 74.7 | 109.5 | 104.4 | 111.1 |
## ST
#### Training
Es-En as example:
```bash
fairseq-train ${MTEDX_ROOT}/es-en \
--config-yaml config_st.yaml --train-subset train_st --valid-subset valid_st \
--save-dir ${ST_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-epoch 200 \
--task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
--arch s2t_transformer_xs --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --dropout 0.3 --label-smoothing 0.1 \
--load-pretrained-encoder-from ${PRETRAINED_ENCODER} \
--skip-invalid-size-inputs-valid-test \
--keep-last-epochs 10 --update-freq 8 --patience 10
```
For multilingual model (all 12 directions):
```bash
fairseq-train ${MTEDX_ROOT} \
--config-yaml config_st.yaml \
--train-subset train_el-en_st,train_es-en_st,train_es-fr_st,train_es-it_st,train_es-pt_st,train_fr-en_st,train_fr-es_st,train_fr-pt_st,train_it-en_st,train_it-es_st,train_pt-en_st,train_pt-es_st,train_ru-en_st \
--valid-subset valid_el-en_st,valid_es-en_st,valid_es-fr_st,valid_es-it_st,valid_es-pt_st,valid_fr-en_st,valid_fr-es_st,valid_fr-pt_st,valid_it-en_st,valid_it-es_st,valid_pt-en_st,valid_pt-es_st,valid_ru-en_st \
--save-dir ${MULTILINGUAL_ST_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-epoch 200 \
--task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
--arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --dropout 0.3 --label-smoothing 0.1 \
--skip-invalid-size-inputs-valid-test \
--keep-last-epochs 10 --update-freq 8 --patience 10 \
--ignore-prefix-size 1 \
--load-pretrained-encoder-from ${PRETRAINED_ENCODER}
```
where `ST_SAVE_DIR` (`MULTILINGUAL_ST_SAVE_DIR`) is the checkpoint root path. The ST encoder is pre-trained by ASR
for faster training and better performance: `--load-pretrained-encoder-from <(JOINT_)ASR checkpoint path>`. We set
`--update-freq 8` to simulate 8 GPUs with 1 GPU. You may want to update it accordingly when using more than 1 GPU.
For multilingual models, we prepend target language ID token as target BOS, which should be excluded from
the training loss via `--ignore-prefix-size 1`.
#### Inference & Evaluation
Average the last 10 checkpoints and evaluate on the `test` split:
```bash
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
--inputs ${ST_SAVE_DIR} --num-epoch-checkpoints 10 \
--output "${ST_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${MTEDX_ROOT}/es-en \
--config-yaml config_st.yaml --gen-subset test --task speech_to_text \
--path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} \
--max-tokens 50000 --beam 5 --scoring sacrebleu --remove-bpe
# For multilingual models
python scripts/average_checkpoints.py \
--inputs ${MULTILINGUAL_ST_SAVE_DIR} --num-epoch-checkpoints 10 \
--output "${MULTILINGUAL_ST_SAVE_DIR}/${CHECKPOINT_FILENAME}"
for LANGPAIR in es-en es-fr es-pt fr-en fr-es fr-pt pt-en pt-es it-en it-es ru-en el-en; do
fairseq-generate ${MTEDX_ROOT} \
--config-yaml config_st.yaml --gen-subset test_${LANGPAIR}_st --task speech_to_text \
--prefix-size 1 --path ${MULTILINGUAL_ST_SAVE_DIR}/${CHECKPOINT_FILENAME} \
--max-tokens 40000 --beam 5 \
--skip-invalid-size-inputs-valid-test \
--scoring sacrebleu --remove-bpe
done
```
For multilingual models, we force decoding from the target language ID token (as BOS) via `--prefix-size 1`.
#### Results
| Data | --arch | Params | Es-En | Es-Pt | Es-Fr | Fr-En | Fr-Es | Fr-Pt | Pt-En | Pt-Es | It-En | It-Es | Ru-En | El-En |
|--------------|--------------------|-----|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| Bilingual | s2t_transformer_xs | 10M | 7.0 | 12.2 | 1.7 | 8.9 | 10.6 | 7.9 | 8.1 | 8.7 | 6.4 | 1.0 | 0.7 | 0.6 |
| Multilingual | s2t_transformer_s | 31M | 12.3 | 17.4 | 6.1 | 12.0 | 13.6 | 13.2 | 12.0 | 13.7 | 10.7 | 13.1 | 0.6 | 0.8 |
## Citation
Please cite as:
```
@misc{salesky2021mtedx,
title={Multilingual TEDx Corpus for Speech Recognition and Translation},
author={Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post},
year={2021},
}
@inproceedings{wang2020fairseqs2t,
title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
year = {2020},
}
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
year = {2019},
}
```
[[Back]](..)
|