Update README.md
Browse files
README.md
CHANGED
@@ -60,7 +60,7 @@ Along with the open weights, all training scripts and configuration files are ma
|
|
60 |
|
61 |
### Description
|
62 |
|
63 |
-
Transformer-based decoder-only language model that has been pre-trained from scratch on
|
64 |
The pre-training corpus contains text in 35 European languages and code.
|
65 |
|
66 |
### Hyperparameters
|
@@ -285,7 +285,7 @@ The initial three training epochs used 2.4 trillion tokens, obtained by manually
|
|
285 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
286 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
287 |
Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
288 |
-
This adjustment resulted in a total of 2.
|
289 |
|
290 |

|
291 |
|
@@ -432,8 +432,8 @@ To consult the data summary document with the respective licences, please send a
|
|
432 |
</details>
|
433 |
|
434 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
435 |
-
of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.
|
436 |
-
and 1 final
|
437 |
|
438 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
439 |
|
|
|
60 |
|
61 |
### Description
|
62 |
|
63 |
+
Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data.
|
64 |
The pre-training corpus contains text in 35 European languages and code.
|
65 |
|
66 |
### Hyperparameters
|
|
|
285 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
286 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
287 |
Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
288 |
+
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
289 |
|
290 |

|
291 |
|
|
|
432 |
</details>
|
433 |
|
434 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
435 |
+
of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
|
436 |
+
and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
|
437 |
|
438 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
439 |
|