Text Generation
Transformers
Safetensors
llama
text-generation-inference
jsaizant commited on
Commit
b3d12ac
·
verified ·
1 Parent(s): 85aaa95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -60,7 +60,7 @@ Along with the open weights, all training scripts and configuration files are ma
60
 
61
  ### Description
62
 
63
- Transformer-based decoder-only language model that has been pre-trained from scratch on 11.675 trillion tokens of highly curated data.
64
  The pre-training corpus contains text in 35 European languages and code.
65
 
66
  ### Hyperparameters
@@ -285,7 +285,7 @@ The initial three training epochs used 2.4 trillion tokens, obtained by manually
285
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
286
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
287
  Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
288
- This adjustment resulted in a total of 2.08 trillion tokens, distributed as outlined below:
289
 
290
  ![lang distrib](./images/corpus_languages.png)
291
 
@@ -432,8 +432,8 @@ To consult the data summary document with the respective licences, please send a
432
  </details>
433
 
434
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
435
- of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.08T tokens per epoch;
436
- and 1 final round of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 11.675 trillion tokens.
437
 
438
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
439
 
 
60
 
61
  ### Description
62
 
63
+ Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data.
64
  The pre-training corpus contains text in 35 European languages and code.
65
 
66
  ### Hyperparameters
 
285
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
286
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
287
  Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
288
+ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
289
 
290
  ![lang distrib](./images/corpus_languages.png)
291
 
 
432
  </details>
433
 
434
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
435
+ of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
436
+ and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
437
 
438
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
439