Typos and potential clarification on the use of Colossal OSCAR

#5
by pcuenq HF staff - opened
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -48,7 +48,7 @@ language:
48
  >
49
  > The weights will be promptly updated as soon as the training process is complete.
50
 
51
- # Salmandra ALIA-40b Model Card
52
 
53
  ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
54
 
@@ -282,10 +282,10 @@ for output in outputs:
282
  ### Pretraining Data
283
 
284
  The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
285
- The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
286
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
287
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
288
- Following, during the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
289
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
290
 
291
  ![lang distrib](./images/corpus_languages.png)
 
48
  >
49
  > The weights will be promptly updated as soon as the training process is complete.
50
 
51
+ # Salamandra ALIA-40b Model Card
52
 
53
  ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
54
 
 
282
  ### Pretraining Data
283
 
284
  The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
285
+ The initial 1.5 training epochs used 2.4 trillion tokens from Colossal OSCAR, obtained by manually adjusting data proportion to balance the representation
286
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
287
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
288
+ During the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
289
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
290
 
291
  ![lang distrib](./images/corpus_languages.png)