BSC-LT
/

salamandra-2b

Text Generation

text-generation-inference

Model card Files Files and versions Community

jsaizant commited on Jan 20

Commit

b3d12ac

·

verified ·

1 Parent(s): 85aaa95

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -60,7 +60,7 @@ Along with the open weights, all training scripts and configuration files are ma
 ### Description
-Transformer-based decoder-only language model that has been pre-trained from scratch on 11.675 trillion tokens of highly curated data.
 The pre-training corpus contains text in 35 European languages and code.
 ### Hyperparameters
@@ -285,7 +285,7 @@ The initial three training epochs used 2.4 trillion tokens, obtained by manually
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
 Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
-This adjustment resulted in a total of 2.08 trillion tokens, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)
@@ -432,8 +432,8 @@ To consult the data summary document with the respective licences, please send a
 </details>
 The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
-of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.08T tokens per epoch;
-and 1 final round of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 11.675 trillion tokens.
 We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).

 ### Description
+Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data.
 The pre-training corpus contains text in 35 European languages and code.
 ### Hyperparameters
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
 Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
+This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)
 </details>
 The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
+of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
+and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
 We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).