BSC-LT
/

ALIA-40b

@@ -48,7 +48,7 @@ language:
 >
 > The weights will be promptly updated as soon as the training process is complete.
-# Salmandra ALIA-40b Model Card
 ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
@@ -282,10 +282,10 @@ for output in outputs:
 ### Pretraining Data
 The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
-The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
-Following, during the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
 This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)

 >
 > The weights will be promptly updated as soon as the training process is complete.
+# Salamandra ALIA-40b Model Card
 ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
 ### Pretraining Data
 The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
+The initial 1.5 training epochs used 2.4 trillion tokens from Colossal OSCAR, obtained by manually adjusting data proportion to balance the representation
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
+During the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
 This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)