xixianliao's picture
Update README.md
ffe9a2e verified
|
raw
history blame
9.2 kB
metadata
license: apache-2.0

Data

Pretraining Data

The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque, Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.

This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU project and Project Aina’s existing corpora. Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using Projecte Aina’s Spanish-Catalan model. The final distribution of languages was as below:

Instruction Tuning Data

We include machine translation-related tasks but do not include chat data.

Click the expand button below to see the full list of tasks included in the training data.

Data Sources
Task Source Languages Count
Chat N/A N/A 0
Multi-reference Translation TowerBlocks mixed 10000
Paraphrase TowerBlocks mixed 3521
Named-entity Recognition AnCora-Ca-NER ca 12059
Named-entity Recognition BasqueGLUE, EusIE eu 4304
Named-entity Recognition SLI NERC Galician Gold Corpus gl 6483
Named-entity Recognition TowerBlocks pt 854
Named-entity Recognition TowerBlocks nl 800
Named-entity Recognition TowerBlocks es 1654
Named-entity Recognition TowerBlocks en 1671
Named-entity Recognition TowerBlocks ru 800
Named-entity Recognition TowerBlocks it 858
Named-entity Recognition TowerBlocks fr 857
Named-entity Recognition TowerBlocks de 1312
Terminology-aware Translation TowerBlocks en-ru 50
Terminology-aware Translation TowerBlocks en-fr 29
Automatic Post Edition TowerBlocks en-fr 6133
Automatic Post Edition TowerBlocks en-nl 9077
Automatic Post Edition TowerBlocks en-pt 5762
Automatic Post Edition TowerBlocks de-en 10000
Automatic Post Edition TowerBlocks en-de 10000
Machine Translation Evaluation TowerBlocks-sample en-ru, en-pl, ru-en, en-de, en-ru, de-fr, de-en, en-de 353
Machine Translation Evaluation BSC four pivot languages (eu, es, ca, gl) paired with European languages (bg, cs, da, de, el, en, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv) 9700
General Machine Translation TowerBlocks nl-en, en-ru, it-en, fr-en, es-en, en-fr, ru-en, fr-de, en-nl, de-fr 500
General Machine Translation BSC three pivot languages (es, ca, en) paired with European languages (ast, arn, arg, bg, cs, cy, da, de, el, et, fi, ga, gl, hr, it, lt, lv, mt, nb, nn, nl, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk, eu) 9350
Fill-in-the-Blank BSC European languages (cs, da, de, el, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv), pivot in ca, es, eu, gl, en 11500
Document-level Translation BSC two pivot languages (es, en) paired with European languages (bg, cs, da, de, el, et, fi, fr, hu, it, lt, lv, nl, pl, pt, ro, ru, sk, sv) 7600
Paragraph-level Translation BSC two pivot languages (es, en) paired with European languages (bg, cs, da, de, el, et, fi, fr, hu, it, lt, lv, nl, pl, pt, ro, ru, sk, sv) 7600
Contextual Machine Translation TowerBlocks en-it 348
Contextual Machine Translation TowerBlocks en-ru 454
Contextual Machine Translation TowerBlocks en-fr 369
Contextual Machine Translation TowerBlocks en-nl 417
Contextual Machine Translation TowerBlocks en-es 431
Contextual Machine Translation TowerBlocks en-de 558