Typos and potential clarification on the use of Colossal OSCAR
#5
by
pcuenq
HF staff
- opened
README.md
CHANGED
@@ -48,7 +48,7 @@ language:
|
|
48 |
>
|
49 |
> The weights will be promptly updated as soon as the training process is complete.
|
50 |
|
51 |
-
#
|
52 |
|
53 |
ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
|
54 |
|
@@ -282,10 +282,10 @@ for output in outputs:
|
|
282 |
### Pretraining Data
|
283 |
|
284 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
285 |
-
The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
286 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
287 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
288 |
-
|
289 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
290 |
|
291 |
![lang distrib](./images/corpus_languages.png)
|
|
|
48 |
>
|
49 |
> The weights will be promptly updated as soon as the training process is complete.
|
50 |
|
51 |
+
# Salamandra ALIA-40b Model Card
|
52 |
|
53 |
ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
|
54 |
|
|
|
282 |
### Pretraining Data
|
283 |
|
284 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
285 |
+
The initial 1.5 training epochs used 2.4 trillion tokens from Colossal OSCAR, obtained by manually adjusting data proportion to balance the representation
|
286 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
287 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
288 |
+
During the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
289 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
290 |
|
291 |
![lang distrib](./images/corpus_languages.png)
|