xixianliao commited on
Commit
ffe9a2e
·
verified ·
1 Parent(s): 23d391a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ ## Data
6
+
7
+ ### Pretraining Data
8
+
9
+ The training corpus consists of 70 billion tokens of Catalan- and Spanish-centric parallel data, including all of the official European languages plus Catalan, Basque,
10
+ Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012 parallel sentence pairs.
11
+
12
+ This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/) and Project Aina’s existing corpora.
13
+ Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
14
+ [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
15
+
16
+ ![](./treemap.png)
17
+
18
+ ### Instruction Tuning Data
19
+
20
+ We include machine translation-related tasks but do not include chat data.
21
+
22
+ Click the expand button below to see the full list of tasks included in the training data.
23
+
24
+ <details>
25
+ <summary>Data Sources</summary>
26
+
27
+
28
+
29
+ | Task | Source | Languages | Count |
30
+ |----------------------------------|------------------------------------------------------------------------------------------|----------------------------------------------------------------|--------|
31
+ | Chat | N/A | N/A | 0 |
32
+ | Multi-reference Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | mixed | 10000 |
33
+ | Paraphrase | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | mixed | 3521 |
34
+ | Named-entity Recognition | [AnCora-Ca-NER](https://huggingface.co/datasets/projecte-aina/ancora-ca-ner) | ca | 12059 |
35
+ | Named-entity Recognition | [BasqueGLUE](https://huggingface.co/datasets/orai-nlp/basqueGLUE), [EusIE](https://huggingface.co/datasets/HiTZ/EusIE) | eu | 4304 |
36
+ | Named-entity Recognition | [SLI NERC Galician Gold Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora) | gl | 6483 |
37
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | pt | 854 |
38
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | nl | 800 |
39
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | es | 1654 |
40
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en | 1671 |
41
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | ru | 800 |
42
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | it | 858 |
43
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | fr | 857 |
44
+ | Named-entity Recognition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | de | 1312 |
45
+ | Terminology-aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-ru | 50 |
46
+ | Terminology-aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-fr | 29 |
47
+ | Automatic Post Edition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-fr | 6133 |
48
+ | Automatic Post Edition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-nl | 9077 |
49
+ | Automatic Post Edition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-pt | 5762 |
50
+ | Automatic Post Edition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | de-en | 10000 |
51
+ | Automatic Post Edition | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-de | 10000 |
52
+ | Machine Translation Evaluation | TowerBlocks-sample | en-ru, en-pl, ru-en, en-de, en-ru, de-fr, de-en, en-de | 353 |
53
+ | Machine Translation Evaluation | BSC | four pivot languages (eu, es, ca, gl) paired with European languages (bg, cs, da, de, el, en, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv) | 9700 |
54
+ | General Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | nl-en, en-ru, it-en, fr-en, es-en, en-fr, ru-en, fr-de, en-nl, de-fr | 500 |
55
+ | General Machine Translation | BSC | three pivot languages (es, ca, en) paired with European languages (ast, arn, arg, bg, cs, cy, da, de, el, et, fi, ga, gl, hr, it, lt, lv, mt, nb, nn, nl, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk, eu) | 9350 |
56
+ | Fill-in-the-Blank | BSC | European languages (cs, da, de, el, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv), pivot in ca, es, eu, gl, en | 11500 |
57
+ | Document-level Translation | BSC | two pivot languages (es, en) paired with European languages (bg, cs, da, de, el, et, fi, fr, hu, it, lt, lv, nl, pl, pt, ro, ru, sk, sv) | 7600 |
58
+ | Paragraph-level Translation | BSC | two pivot languages (es, en) paired with European languages (bg, cs, da, de, el, et, fi, fr, hu, it, lt, lv, nl, pl, pt, ro, ru, sk, sv) | 7600 |
59
+ | Contextual Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-it | 348 |
60
+ | Contextual Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-ru | 454 |
61
+ | Contextual Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-fr | 369 |
62
+ | Contextual Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-nl | 417 |
63
+ | Contextual Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-es | 431 |
64
+ | Contextual Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | en-de | 558 |
65
+
66
+
67
+
68
+
69
+
70
+ </details>
71
+