Update README.md
Browse files
README.md
CHANGED
@@ -39,6 +39,28 @@ language:
|
|
39 |
- sr
|
40 |
- sv
|
41 |
- uk
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
---
|
43 |
|
44 |

|
@@ -284,13 +306,13 @@ The pre-training corpus comprises data from 35 European languages and 92 program
|
|
284 |
The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
285 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
286 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
287 |
-
|
288 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
289 |
|
290 |
-
](https://arxiv.org/pdf/1803.09010).
|
@@ -465,7 +487,7 @@ and public institutions, which can be found in detail in the acknowledgements.
|
|
465 |
|
466 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
467 |
|
468 |
-
This work
|
469 |
|
470 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
471 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
@@ -1097,4 +1119,4 @@ Technical report coming soon.
|
|
1097 |
|:---:|:---:|:---:|
|
1098 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1099 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1100 |
-
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|
|
|
39 |
- sr
|
40 |
- sv
|
41 |
- uk
|
42 |
+
datasets:
|
43 |
+
- oscar-corpus/colossal-oscar-1.0
|
44 |
+
- HuggingFaceFW/fineweb-edu
|
45 |
+
- joelniklaus/eurlex_resources
|
46 |
+
- joelito/legal-mc4
|
47 |
+
- projecte-aina/CATalog
|
48 |
+
- UFRGS/brwac
|
49 |
+
- community-datasets/hrwac
|
50 |
+
- danish-foundation-models/danish-gigaword
|
51 |
+
- HiTZ/euscrawl
|
52 |
+
- PleIAs/French-PD-Newspapers
|
53 |
+
- PleIAs/French-PD-Books
|
54 |
+
- AI-team-UoA/greek_legal_code
|
55 |
+
- HiTZ/latxa-corpus-v1.1
|
56 |
+
- allenai/peS2o
|
57 |
+
- pile-of-law/pile-of-law
|
58 |
+
- PORTULAN/parlamento-pt
|
59 |
+
- hoskinson-center/proof-pile
|
60 |
+
- togethercomputer/RedPajama-Data-1T
|
61 |
+
- bigcode/starcoderdata
|
62 |
+
- bjoernp/tagesschau-2018-2023
|
63 |
+
- EleutherAI/the_pile_deduplicated
|
64 |
---
|
65 |
|
66 |

|
|
|
306 |
The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
307 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
308 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
309 |
+
During the following epochs, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
310 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
311 |
|
312 |
+

|
313 |
|
314 |
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
315 |
+
Following this, Starcoder provides 13,67%, and FineWeb-Edu (350BT subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
|
316 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
317 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
318 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
454 |
</details>
|
455 |
|
456 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
457 |
+
of the Colossal OSCAR dataset was replaced with FineWeb-Edu (350BT subset), resulting in 2.68T tokens per epoch;
|
458 |
and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
|
459 |
|
460 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
|
|
487 |
|
488 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
489 |
|
490 |
+
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
|
491 |
|
492 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
493 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
|
|
1119 |
|:---:|:---:|:---:|
|
1120 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1121 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1122 |
+
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|